Skip to content

feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery#30273

Merged
Sameerlite merged 3 commits into
BerriAI:litellm_oss_170626_1from
Ar-maan05:feat-anthropic-native-v1-models
Jun 17, 2026
Merged

feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery#30273
Sameerlite merged 3 commits into
BerriAI:litellm_oss_170626_1from
Ar-maan05:feat-anthropic-native-v1-models

Conversation

@Ar-maan05

@Ar-maan05 Ar-maan05 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Relevant issues

Fixes #27180

Pre-Submission checklist

  • I have added meaningful tests
  • My PR passes all unit tests on the affected directory
  • My PR's scope is as isolated as possible; it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🆕 New Feature

Changes

Claude Code 2.1.126+ added gateway model discovery: when ANTHROPIC_BASE_URL points at a gateway, it queries {base_url}/v1/models at startup and populates the /model picker with the discovered models. That discovery only parses the Anthropic-native Models API shape, so against litellm (which returns OpenAI's {id, object, created, owned_by} list) Claude Code finds nothing and the picker stays empty, even though /v1/messages already works.

This serves the Anthropic-native shape from the same /v1/models route via content negotiation on the anthropic-version header. Claude Code already sends that header for /v1/messages, so when it is present the endpoint returns the Anthropic Models envelope (type / display_name / created_at per entry, plus top-level has_more / first_id / last_id); otherwise the response is byte-for-byte the existing OpenAI shape, so aider and other OpenAI-compatible clients are unaffected. A separate endpoint was not used because Claude Code discovers at the gateway root's /v1/models, and a global config flag would break the OpenAI clients that share the route.

The full model list is returned and Claude Code applies its own claude/anthropic id-prefix filter client-side, so no server-side filtering is imposed (a model aliased to claude-* that points at any backend still shows up, which is the point for gateway users). display_name falls back to the model id, the stable label a gateway can offer for arbitrary upstream models, and created_at is the ISO 8601 form of the same timestamp the OpenAI shape already returns. Hidden/unhealthy models are filtered before formatting in both the normal and scope=expand branches, exactly as for the OpenAI shape.

Tests

In tests/test_litellm/llms/anthropic/test_anthropic_common_utils.py, two unit tests pin create_anthropic_model_list_response: the full envelope (per-entry type/display_name/created_at with a Z-suffixed ISO timestamp, top-level has_more/first_id/last_id, no object) and the empty-list case (first_id/last_id null). In tests/test_litellm/proxy/proxy_server/test_routes_models.py, a route test drives GET /v1/models (and /models) with the anthropic-version header and asserts the negotiated Anthropic shape, while the existing happy-path test pins that the default response stays OpenAI. Mutation-checked by hand: forcing the negotiation off, emitting object instead of type, and dropping the Z normalization each fail a test. The affected suites are green.

Screenshots / Proof of Fix

Live proxy on localhost:4000 with two claude-* deployments and one gpt-4o.

python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config_27180.yaml --port 4000

Default request (no header): OpenAI format, unchanged.

curl -s http://127.0.0.1:4000/v1/models -H "x-api-key: sk-1234" | python3 -m json.tool
{
    "data": [
        {"id": "claude-opus-4-6", "object": "model", "created": 1677610602, "owned_by": "openai"},
        {"id": "claude-haiku-4-5", "object": "model", "created": 1677610602, "owned_by": "openai"},
        {"id": "gpt-4o", "object": "model", "created": 1677610602, "owned_by": "openai"}
    ],
    "object": "list"
}

With the anthropic-version header (what Claude Code sends): Anthropic-native format.

curl -s http://127.0.0.1:4000/v1/models -H "x-api-key: sk-1234" -H "anthropic-version: 2023-06-01" | python3 -m json.tool
{
    "data": [
        {
            "type": "model",
            "id": "claude-opus-4-6",
            "display_name": "claude-opus-4-6",
            "created_at": "2023-02-28T18:56:42Z"
        },
        {
            "type": "model",
            "id": "claude-haiku-4-5",
            "display_name": "claude-haiku-4-5",
            "created_at": "2023-02-28T18:56:42Z"
        },
        {
            "type": "model",
            "id": "gpt-4o",
            "display_name": "gpt-4o",
            "created_at": "2023-02-28T18:56:42Z"
        }
    ],
    "has_more": false,
    "first_id": "claude-opus-4-6",
    "last_id": "gpt-4o"
}

The shape matches the Anthropic Models API. All models are returned and Claude Code keeps the claude/anthropic-prefixed ones for the picker; clients that do not send anthropic-version get the unchanged OpenAI list.

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Anthropic-native model list content negotiation to the existing /v1/models endpoint: when the anthropic-version request header is present (as Claude Code sends for all Anthropic API calls), the endpoint returns the Anthropic Models API envelope (type/display_name/created_at per entry, plus has_more/first_id/last_id); all other callers continue to receive the unchanged OpenAI-compatible shape.

  • create_anthropic_model_list_response is correctly placed in litellm/llms/anthropic/common_utils.py, satisfying the rule against provider-specific code outside of llms/; it reuses the existing DEFAULT_MODEL_CREATED_AT_TIME constant and produces a Z-suffixed ISO 8601 timestamp consistent with the Anthropic Models API.
  • model_list in proxy_server.py gains a request: Request = None injection and a single wants_anthropic_format flag checked in both the scope=expand and normal branches — the existing OpenAI response path is untouched when the header is absent, so no backwards-incompatible change is introduced.
  • Tests cover the full envelope shape, the empty-list case, and both affected HTTP routes, all fully mocked with no real network calls.

Confidence Score: 5/5

Safe to merge — the change is additive and strictly backward-compatible; callers without the anthropic-version header receive a byte-for-byte identical response to today.

The implementation is a clean, additive content-negotiation layer. Provider-specific formatting logic is correctly placed in the llms/anthropic/ module rather than shared proxy utilities. The request: Request = None default guards all non-HTTP call sites. Both affected branches in model_list handle the new flag consistently, tests are fully mocked, and the existing OpenAI-format path is untouched.

No files require special attention.

Important Files Changed

Filename Overview
litellm/llms/anthropic/common_utils.py Adds create_anthropic_model_list_response helper that builds the Anthropic-native /v1/models envelope; uses the existing DEFAULT_MODEL_CREATED_AT_TIME constant and ISO 8601 Z-suffix formatting. No fastapi imports, correctly placed in the provider-specific module.
litellm/proxy/proxy_server.py Injects request: Request = None into model_list and performs content negotiation on the anthropic-version header in both the scope=expand and normal branches. Existing OpenAI-format response is byte-for-byte unchanged when the header is absent.
tests/test_litellm/llms/anthropic/test_anthropic_common_utils.py Adds two unit tests for create_anthropic_model_list_response: full envelope shape (type/display_name/created_at with Z suffix, has_more/first_id/last_id, no object) and empty-list case. All mocked, no network calls.
tests/test_litellm/proxy/proxy_server/test_routes_models.py Adds test_get_models_anthropic_format_when_header_present that drives both /v1/models and /models with the anthropic-version header and pins the Anthropic-native shape; existing happy-path test is untouched, confirming OpenAI format is unchanged without the header.

Reviews (3): Last reviewed commit: "fix(proxy): make model_list request para..." | Re-trigger Greptile

Comment thread litellm/proxy/utils.py Outdated
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/proxy/proxy_server.py 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Ar-maan05

Copy link
Copy Markdown
Contributor Author

@greptile-apps

@Ar-maan05

Copy link
Copy Markdown
Contributor Author

Hi, my PR is passing CI and has a 5/5 score from greptile. Please let me know if any changes are required.

@Sameerlite

Copy link
Copy Markdown
Collaborator

@greptileai

@Ar-maan05

Copy link
Copy Markdown
Contributor Author

Hi @Sameerlite, please let me know if any changes are needed. Always a pleasure to contribute to this repo.

@Sameerlite Sameerlite changed the base branch from litellm_internal_staging to litellm_oss_170626_1 June 17, 2026 11:23
@Sameerlite Sameerlite merged commit 4e31885 into BerriAI:litellm_oss_170626_1 Jun 17, 2026
73 checks passed
Sameerlite added a commit that referenced this pull request Jun 17, 2026
mateo-berri pushed a commit that referenced this pull request Jun 18, 2026
* fix(proxy): allow non-admin virtual keys to call GA Realtime WebRTC HTTP routes (#30089)

* fix(proxy): allow non-admin virtual keys to call GA Realtime WebRTC HTTP routes

Add the realtime WebRTC HTTP sub-routes (/realtime/client_secrets,
/realtime/calls and their /v1 + /openai/v1 variants) to
LiteLLMRoutes.openai_routes so is_llm_api_route() classifies them as
LLM API routes. Without this, non-admin virtual keys received
401 'Only proxy admin can be used to generate, delete, update info
for new keys/users/teams' when calling these endpoints.

Fixes #29923

* fix(proxy): validate session.model for realtime routes in model-access check

The GA Realtime WebRTC HTTP routes resolve the effective model from the
nested session.model (falling back to the top-level model), but the auth
layer's get_model_from_request() only extracted the top-level model. A
model-restricted virtual key could therefore place a disallowed model in
session.model, leave the top-level model unset, and skip can_key_call_model()
entirely - obtaining an ephemeral token for a model it is not allowed to use.

Extract session.model for the realtime client_secrets/calls routes so the
model-access check runs against the model the request will actually use.
Legitimate callers are unaffected; their permitted model still validates.

Relates to #29923

* fix(proxy): classify realtime transcription_sessions routes as LLM API routes

Add the GA Realtime WebRTC transcription_sessions HTTP routes to
openai_routes so is_llm_api_route() returns True for them, matching the
client_secrets and calls routes already fixed. These endpoints are
registered with user_api_key_auth in realtime_endpoints/endpoints.py, so
without this a non-admin virtual key calling
POST /v1/realtime/transcription_sessions would hit the admin-only 401
branch. Extends the regression test parametrization accordingly.

---------

Co-authored-by: habonlaci <4699494+habonlaci@users.noreply.github.com>

* feat(proxy): surface max_input_tokens/max_output_tokens on /v1/models (#30272)

* feat(proxy): surface max_input_tokens/max_output_tokens on /v1/models

* fix(proxy): degrade /v1/models gracefully when model-group lookup fails

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix: sort tiered token-cost thresholds numerically (#30375)

* fix: sort tiered token-cost thresholds numerically

_get_token_base_cost iterated input_cost_per_token_above_<N>_tokens keys with a
lexicographic sort, so for tiers whose thresholds have different digit lengths
(e.g. 90k vs 128k) a request crossing both was billed at the lower tier that
sorted first. Sort by the parsed numeric threshold instead, so the highest tier
the request actually crosses is applied.

* refactor: reuse _parse_above_token_threshold for inline threshold parse

---------

Co-authored-by: Eric (GabiDevFamily) <271972409+santino18727-debug@users.noreply.github.com>

* fix(openai): preserve cache_control for openai-compatible custom endpoints (#30387)

* fix(openai): preserve cache_control for openai-compatible custom endpoints

* fix(openai): use parsed hostname to detect real OpenAI for cache_control preservation

* fix(proxy): drain all daily-spend batches per flush cycle (#30281) (#30505)

* fix(types): prevent internal parallel_request_limiter fields from leaking to upstream providers (#30545)

* fix(types): add internal parallel_request_limiter fields to all_litellm_params to prevent forwarding to upstream providers

* test(types): add regression test for internal rate-limit fields in all_litellm_params

* fix(init): add bool type annotation to suppress_debug_info (#30531)

Module-level `suppress_debug_info = False` had no annotation, so strict
type checkers (e.g. ty) infer it as `Literal[False]`. Reassigning it to
`True` (as done in proxy_server.py and router.py) then fails with an
invalid-assignment error. Annotate it as `bool` to match every other
flag in this module.

* fix: coalesce null aggregates in update_metrics for no-spend keys (#29945)

* feat(team_endpoints): add query parameter `key_limit` to `/team/info` endpoint (#30006)

* feat(team_endpoints): Add query parameter key_limit to /team/info

* feat(team_endpoints): update schema.d.ts to include the new query parameter

* feat(team_endpoints): add tests for limitting key count in /team/info response

* feat(team_endpoints): Apply suggestions from greptile

* Set greater-than constraint on key-limit
* Fix type

* fix(router): release aiohttp connection when stream iteration ends abnormally (#30271)

* fix(router): release aiohttp connection when stream iteration ends abnormally

A streaming response that terminates with a mid-stream read timeout, a task
cancellation (client disconnect), or GeneratorExit never closed the underlying
aiohttp ClientResponse. aiohttp only auto-releases the connector slot at body
EOF, so each abnormally terminated stream permanently leaked one slot from the
shared TCPConnector pool. During a backend traffic spike the pool drains; once
exhausted every subsequent request to that host waits for a slot, times out
and surfaces as a 408, indefinitely, even after the backend recovers. Only a
proxy restart cleared the in-memory sessions, which matched the reported
symptom of a router stuck returning 408 for a healthy vLLM backend.

Close the response in a finally clause when iteration ends. On a fully read
response the connection was already released at EOF and close() is a no-op,
so keep-alive reuse for normal requests is unchanged.

Fixes #30192

* test(aiohttp): cover GeneratorExit path with a mock instead of a live socket

The previous slot-release test started a real aiohttp TCP server, which can
flake in offline CI and does not exercise this fix's code path directly.
Replace it with a dependency-injected mock that closes the stream generator
(GeneratorExit) and asserts the response is closed, covering the third
abnormal-exit path the finally block handles

* feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery (#30273)

* feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery

* refactor(proxy): move Anthropic model-list formatter into llms/anthropic/common_utils

* fix(proxy): make model_list request param optional for direct callers

* feat(dashscope): add Responses API support (#30286)

* feat(dashscope): add Responses API support

DashScope's OpenAI-compatible endpoint serves /responses, so register a
DashScopeResponsesAPIConfig that routes dashscope/* responses calls to
{api_base}/responses without rewriting the upstream model id, instead of
falling back to the chat-completions -> responses emulation pipeline.

Closes #29780

* feat(dashscope): mark responses API as not supporting native websocket

Matches the hosted_vllm/perplexity/openrouter responses configs, which all
override supports_native_websocket() to False since the OpenAI-compatible
endpoint has no native wss:// responses transport.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(spend-logs): preserve error_message on ProxyException failures (#30381)

* fix(spend-logs): preserve error_message on ProxyException failures

`StandardLoggingPayloadSetup.get_error_information` used
`str(original_exception)` to populate the human-readable error message
stored in `spend_logs.metadata.error_information.error_message`.

`ProxyException` (litellm/proxy/_types.py:3453) sets `self.message` in
its constructor but does NOT call `super().__init__(message)` and does
NOT define `__str__`. As a result, `str(ProxyException(...))` returns
the empty string, and every auth/budget/quota rejection was landing
in spend_logs with `error_message=""` despite a fully populated
traceback.

Operator impact: dashboard "LLM Failure" rows became untriageable —
the only way to tell a 401 from a 429 was to manually unpack the
traceback JSON via psql. Burst failure patterns (e.g. a UI session
polling with a stale token) produced 20-30 indistinguishable
`error_code=401` rows per second.

Fix: prefer the `.message` attribute (set by ProxyException and every
litellm.exceptions.* class) over `str(exc)`. The `str(exc)` fallback
is retained for non-litellm exception types, preserving prior behavior.

Test plan:
  - 2 new unit tests in tests/test_litellm/litellm_core_utils/
    test_litellm_logging.py:
    * test_get_error_information_prefers_message_attribute_over_str
    * test_get_error_information_falls_back_to_str_when_no_message_attr
  - Existing test_get_error_information_error_code_priority still passes
  - End-to-end verified: bad-key 401 now stores full
    "Authentication Error, Invalid proxy server token passed..."
    message in spend_logs.metadata.error_information.error_message

* fix(spend-logs): preserve explicit empty .message + drop dead reference

Greptile P2 on #30381. The truthiness check `if message_attr:`
silently skipped an explicit empty-string `.message` and fell
through to `str(original_exception)`. For ProxyException-shaped
objects both produce empty, so the bug was latent; for other
exception types it would inject a different string into
error_information.error_message and corrupt the signal.

Use `is not None` so an empty string survives verbatim.

Also drop the stale `See e2e/cases/11.` comment reference — that
path does not exist anywhere in the repo and confuses future
readers.

Regression test added: an exception with `.message=""` and a
non-empty `super().__init__()` arg must yield error_message == "".

* ci: retrigger workflows after base branch change to litellm_internal_staging

* fix(anthropic): strip LiteLLM-injected total_tokens from /v1/messages response (#30382)

* fix(anthropic): strip LiteLLM-injected total_tokens from /v1/messages response

The non-streaming /v1/messages response carries a LiteLLM-injected
usage.total_tokens = input_tokens + output_tokens that is not part of
the Anthropic API spec. This caused three problems:

1. Shape divergence with streaming on the same endpoint.
   message_delta.usage in the SSE path never carries total_tokens.
   Clients parsing both paths get two different schemas from one endpoint.

2. Shape divergence with upstream. Direct calls to
   https://api.anthropic.com/v1/messages return no total_tokens field,
   so clients using the official Anthropic SDK couldn't rely on it,
   and clients that did rely on the LiteLLM-injected one broke when
   bypassing the proxy.

3. Numerical misuse. total = input + output undercounts when
   cache_read_input_tokens and cache_creation_input_tokens are
   non-zero, because cache tokens are reported in their own fields.
   A 100k-token cached prompt with 1 non-cache input token + 200
   output tokens reports total_tokens = 201, off by ~99.8% from any
   reasonable definition of "total."

Fix: add _strip_total_tokens_from_anthropic_response in
litellm/proxy/anthropic_endpoints/endpoints.py and invoke it in the
success path of anthropic_response right before returning. Only mutates
dict-shaped responses; streaming (which already lacks the field) is
left untouched.

spend_logs / Prometheus continue to compute total_tokens internally
for billing — this fix only strips the field from the wire response.

Scope: only the Anthropic passthrough endpoint /v1/messages. The
OpenAI-shape /v1/chat/completions is unaffected.

* fix(anthropic): gate total_tokens strip behind flag + handle Pydantic .usage

Two P1 greptile threads on #30382:

P1 — **Backwards-incompatible removal without a feature flag**
  Stripping `usage.total_tokens` unconditionally breaks any client
  currently reading the LiteLLM-shaped non-streaming /v1/messages
  response. Per the codebase's policy (mirrors #30418), gate behind
  a new flag.

  - `litellm.strip_anthropic_total_tokens: bool = False` (default —
    backward-compat: clients keep seeing total_tokens).
  - Env override: `LITELLM_STRIP_ANTHROPIC_TOTAL_TOKENS=true`.
  - Docstring: planned to flip to True in a future major release;
    opt in early.

P1 — **Silent no-op if `result` is a Pydantic model**
  `base_process_llm_request` may return a Pydantic-style object
  whose `.usage` is a plain dict (the most common shape — e.g.
  objects wrapping raw upstream JSON). The original
  `isinstance(response, dict)` guard skipped strip on those, so
  `total_tokens` would still hit the wire. Helper now also reads
  `getattr(response, "usage", None)` and strips when that's a dict.

  Strongly-typed Pydantic `Usage` sub-models with required
  `total_tokens` fields are still skipped — those impose type
  constraints the helper doesn't try to subvert.

Tests:
- `test_strips_total_tokens_on_pydantic_model_with_dict_usage`
- `test_flag_defaults_off`
8/8 pass locally.

* fix(anthropic): drop env var for strip flag (docs CI)

Mirrors #30418's pattern (`expose_router_debug_in_errors: bool = True`,
no `os.getenv`). The `LITELLM_STRIP_ANTHROPIC_TOTAL_TOKENS` env var
introduced in the prior commit was flagged by
`tests/documentation_tests/test_env_keys.py` because the documentation
file `docs/my-website/docs/proxy/config_settings.md` lives in
`BerriAI/litellm-docs` (separate repo) and registering a new env key
requires a parallel docs PR — a friction we avoid here by exposing
the flag only as a Python attribute + `litellm_settings` config key,
both of which load through the existing proxy config plumbing without
needing the env-var registry to be updated.

No semantic change: default still False, behavior identical when set
via `litellm.strip_anthropic_total_tokens = True` or
`litellm_settings.strip_anthropic_total_tokens: true` in config.yaml.

Verified locally: env scan no longer surfaces the key; 8/8 tests pass.

* ci: retrigger workflows after base branch change to litellm_internal_staging

* fix(pricing): correct swapped input/output token costs for command-r7b-12-2024 (#30413)

* fix(pricing): correct swapped input/output token costs for command-r7b-12-2024

* test: resolve model prices JSON relative to test file for pip installs

* fix(exception-mapping): map Gemini upstream-error body code 429 to RateLimitError (#30417)

* fix(exception-mapping): map Gemini upstream-error body code 429 to RateLimitError

Some Gemini-compatible gateways (e.g. new-api) wrap a 429 rate-limit
signal from upstream inside an HTTP 500/503 envelope, with the real
code only surfaced in the JSON body:

    {"error":{"message":"...high demand...","type":"upstream_error",
              "param":"","code":429}}

Previously LiteLLM only looked at the HTTP status and mapped this to
InternalServerError, which Router treats as non-retryable for many
configs — so users got hard 500s instead of fallback/retry.

Now the Gemini/Vertex exception mapper parses error.code from the body
and routes code 429 to RateLimitError before falling through to the
HTTP-status branches. Other body codes fall through unchanged.

Tests cover:
- new-api gateway's `code:429` payload now maps to RateLimitError
- Genuine 500-body responses stay InternalServerError
- Non-JSON body strings fall through to status-code mapping unchanged

* fix(exception-mapping): scope body-code 429 promotion to 5xx envelopes

Addresses greptile P1/P2 + @Sameerlite's review on #30417. The new
elif branch was firing for any HTTP status, so a gateway response of
HTTP 400 with body {"error":{"code":429,...}} would be incorrectly
promoted to RateLimitError (retryable) instead of falling through
to BadRequestError. Same trap for 401 -> AuthenticationError.

Scoped the body-code 429 check to `500 <= status_code < 600` —
covers 500/502/503/504 (gateways wrapping upstream 429 in any 5xx
envelope) without inviting the 4xx misclassification.

Tests: parametrized table now covers 5xx (500/502/503), 4xx (400/401),
and the existing fall-through cases, asserting each maps to the
exception type that matches the HTTP status code. 50/50 pass locally.

* ci: retrigger workflows after base branch change to litellm_internal_staging

* feat(router): add expose_router_debug_in_errors flag (default True) to redact internal model_group/fallback names (#30418)

* feat(router)!: redact internal model_group/fallback names from exception messages

The Router was unconditionally appending internal config names onto
exception.message:
  - "Received Model Group=..."
  - "Available Model Group Fallbacks=..."
  - "No fallback model group found... Fallbacks={...}"
  - "context_window_fallbacks={...}"
  - Deployment-timeout messages including model_group
  - Fallback failure detail listing fallback chain

ProxyException forwards .message verbatim to clients, so gateways were
leaking their model_name / fallback wiring in every failed call.

Fix: gate all five mutation sites on a new
`litellm.expose_router_debug_in_errors` flag (default False). Set to
True to restore upstream debug behavior for local debugging.

Why: matches the redaction posture this codebase already has for
upstream model identifiers (cf. _litellm_returned_model_name) and
removes the last common error-path leak of internal model_group names.

Breaking change marker (!): if anything parses "Received Model Group="
out of client error messages, flip the flag on or migrate to the
x-litellm-* response headers instead.

Tests: 7 cases covering each of the 5 redaction sites + the flag-on
inverse path, plus a "default off" sanity check.

* test(router): cover sites 1 + 3 of expose_router_debug_in_errors gate

Addresses Greptile / codecov feedback on #30418: patch coverage was
55.6% with 4 lines uncovered in litellm/router.py. The existing tests
exercised sites 2 (ContextWindowExceededError), 4 (no-fallback-found),
and 5 (Received Model Group) — both default and flag-on. Sites 1 and 3
were declared in the PR description as covered by "site 5 also fires"
but the gate body lines for each (the `e.message +=` inside the
`if litellm.expose_router_debug_in_errors:` branch) only execute when
the flag is on AND the specific exception path is taken, which neither
existing test triggered.

Added 4 new tests (default + flag-on × 2 sites):

  - test_default_does_not_leak_deployment_timeout_debug
  - test_flag_on_leaks_deployment_timeout_debug
  - test_default_does_not_leak_content_policy_fallback_hint
  - test_flag_on_leaks_content_policy_fallback_hint

Trigger details:

  - Site 1 (litellm.Timeout in _acompletion) is reached via the
    Router-supported `mock_timeout=True` + `timeout=0.001` kwargs on
    `acompletion(...)`. Cannot embed a Timeout instance in model_list
    because Router.__init__ deep-copies it and Timeout.__reduce__ does
    not preserve the required positional args.
  - Site 3 (ContentPolicyViolationError without content_policy_fallbacks
    set, in async_function_with_fallbacks_common_utils) is reached by
    passing a `mock_response=litellm.ContentPolicyViolationError(...)`
    instance via the call-site kwarg — same deepcopy-avoidance reason.

11/11 tests pass locally. Patch coverage on litellm/router.py for this
PR's diff should now be 100%.

* chore(router): flip expose_router_debug_in_errors default to True

Addresses @Sameerlite's review on #30418 — maintain backward
compat on the wire. Redact becomes opt-in via setting the flag
to False; the historical behavior (leak internal model_group /
fallback wiring through exception messages) is preserved as the
default.

- litellm/__init__.py: default flipped to True, docstring rewritten
  with deprecation note pointing at a future flip to False (redact
  by default) in a major release.
- tests/test_litellm/test_router_exception_redaction.py: fixture
  resets to True (was False); the "off" tests now explicitly set
  False; the "default_leaks_*" tests rely on the fixture default.
  test_flag_defaults_off -> test_flag_defaults_on.
- No router.py change needed; the gate keys off the same flag,
  only the default changes.
- PR title no longer needs the breaking-change `!` marker — no
  client sees a behavior change at default settings.

11/11 pass locally.

* ci: retrigger workflows after base branch change to litellm_internal_staging

* feat(guardrails): integrate Repelloai Argus guardrail (#30465)

* feat(guardrails): add RepelloAI Argus guardrail integration (#1)

* feat(guardrails): add RepelloAI Argus guardrail integration

Add a new guardrail hook backed by RepelloAI Argus, with dashboard-managed
asset policies enforced via an asset_id and X-API-Key auth.

* fix(guardrails): harden RepelloAI Argus guardrail

- scan streaming responses on output (was bypassing the guardrail)
- log blocked verdicts as guardrail_intervened instead of success
- treat auth/config errors (401/403/404/422) as misconfiguration that
  always blocks, not a fail-open-able unreachable error
- default unreachable_fallback to fail_closed and read it directly;
  block on unknown/malformed verdicts so an API change can't silently
  disable enforcement
- type unreachable_fallback as a Literal, drop the duplicate config model,
  expose unreachable_fallback in the config schema, and stop leaking the
  raw provider response / exception strings to the client

* fix(guardrails): address RepelloAI Argus review feedback

- support ARGUS_API_KEY (with REPELLOAI_API_KEY fallback)
- make asset_id required in the config model
- normalize unreachable_fallback so only fail_open opens; block on 400 misconfig
- correct the shared unreachable_fallback field description

* docs(guardrails): add RepelloAI Argus docs page and dashboard listing

- add docs page covering config, env vars, modes, verdicts, failure semantics
- list RepelloAI Argus in the Guardrail Garden with provider/logo mappings
- add a regression test for the provider logo and display-name resolution

* fix(guardrails): keep RepelloAI asset_id optional in config model

A required asset_id leaked onto the shared LitellmParams (which inherits
RepelloAIGuardrailConfigModel), breaking validation for every other
guardrail. Keep it optional like sibling models; the guardrail __init__
still raises when asset_id is missing, which is the real enforcement.

* Add comment for last user turn scanning

* feat(guardrails): harden repelloai scanning

* feat(guardrails): expand repelloai scanning to include tool definitions

Add extraction of tool definitions and tool call arguments to the RepelloAI
guardrail scanning. Improves detection coverage by including function schemas
and parameters in the prompt sent to the guardrail service. Also captures
detailed error responses in logs and adds guardrail header to streaming responses.

* refactor(guardrails): fix and harden repelloai schema text extraction

- Fix duplicate text in _iter_schema_text: previously all dict values were
  re-queued onto the stack even after scalar/list keys were already extracted
  explicitly, causing names/descriptions to appear twice in the scanned prompt
- Extract schema key frozensets to module-level constants so they are not
  reconstructed on every call
- Change _iter_schema_text from @classmethod to @staticmethod (cls unused)
- Narrow _call_analyze stage param from str to Literal["prompt", "response"]
- Add HttpxResponse type annotation to _raise_for_config_error
- Add LLMResponseTypes annotation to async_post_call_success_hook response param

* fix(guardrails): resolve pyright type errors in repelloai guardrail

- Narrow async_handler.post return from Response|None to Response with
  explicit None guard before calling raise_for_status/json
- Fix list comprehension returning str|None by switching to explicit loop
  with isinstance guard so pyright tracks the narrowing
- Cast model_dump() result to Dict since hasattr does not narrow object
  type in pyright

* fix(guardrails/repello): include Responses API instructions field in prompt scan

The /v1/responses top-level `instructions` field was not included in
_extract_prompt_text, allowing a caller to bypass guardrail policy checks
by putting blocked content in `instructions` while keeping `input` benign.

* feat: add api_key to config model and read prompt from data dict

* fix(guardrails/repello): plug input_text and tool-call response bypass gaps

Responses API input content parts with type 'input_text' were silently
dropped by build_inspection_messages (which only handles type='text'),
allowing callers to send blocked content via that path without triggering
the pre-call scan. Fix: add _extract_input_text_parts to RepelloAIGuardrail
and call it when walking the Responses API input messages.

Post-call scanning skipped responses whose choices contained only tool_calls
or function_call (message.content=None), letting models put blocked output in
function arguments undetected. Fix: _extract_chat_completion_text now calls
_extract_tool_call_args_from_message on each choice message.

Also replace typing.Dict/List with builtin dict/list to clear TID251 strict
ruff violations introduced by this file.

* fix(guardrails/repello): scan Responses API function_call output arguments

Output items with type 'function_call' in a /v1/responses response were
skipped by _extract_responses_api_text; only 'message' items were walked.
A model could return blocked content in function_call.arguments undetected.
Now extract arguments from function_call output items before scanning.

* fix(anthropic): drop orphaned server_tool_use on multi-turn replay from generic OpenAI clients (#30486)

* fix(anthropic): drop orphaned server_tool_use on multi-turn replay from generic OpenAI clients

When an Anthropic server-side tool (web_search, id `srvtoolu_...`) is used, its
result is carried in `provider_specific_fields.web_search_results` — PRs #17746
/ #17798 restore it for callers that round-trip provider_specific_fields. A
generic OpenAI client that does NOT preserve provider_specific_fields (e.g. Open
WebUI talking to a Vertex/Anthropic model over /chat/completions) drops it on
replay and instead sends back an assistant `tool_call` + a `tool` message both
keyed to the `srvtoolu_` id. The transform then produced a bare `server_tool_use`
(with no following *_tool_result) plus a user `tool_result` for the same id —
both invalid, so the next turn 400s:

  messages.N.content.0: unexpected `tool_use_id` found in `tool_result` blocks:
  srvtoolu_... Each `tool_result` block must have a corresponding `tool_use`
  block in the previous message.

This is the commonly-reported vertex_ai symptom where Gemini works but Claude
400s on the 2nd turn of a web-search chat.

Fix (litellm/litellm_core_utils/prompt_templates/factory.py):
- convert_to_anthropic_tool_invoke: only emit a server_tool_use when its matching
  *_tool_result is available to pair with it; otherwise skip it (a bare
  server_tool_use is itself rejected).
- anthropic_messages_pt: drop a replayed `tool`/`function` message whose
  tool_call_id starts with `srvtoolu_` (a server-executed tool produces no client
  result; a user tool_result for it is invalid).

The existing reconstruction path (provider_specific_fields present, e.g. the
litellm SDK) is unchanged, as is regular client tool_use/tool_result.

Tests (tests/llm_translation/test_prompt_factory.py):
- update test_convert_to_anthropic_tool_invoke_server_tool ->
  test_convert_to_anthropic_tool_invoke_server_tool_without_result_is_dropped
- add test_anthropic_messages_pt_generic_client_drops_orphan_server_tool

Follow-up to #17746 / #17798; addresses the generic-client (no
provider_specific_fields) case of #17737.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(anthropic): cover the srvtoolu_ round-trip fix in the test_litellm unit suite

The regression tests added in tests/llm_translation/test_prompt_factory.py aren't
run by the coverage CI job (it runs tests/test_litellm), so the new factory.py
branches showed as uncovered (codecov patch coverage). Add equivalent focused
tests in the unit suite so both new branches are exercised there:
- convert_to_anthropic_tool_invoke drops a srvtoolu_ server_tool_use when no
  matching *_tool_result is available.
- anthropic_messages_pt drops the orphaned srvtoolu_ tool message a generic
  OpenAI client replays.

Refs #17737

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(anthropic): cover the server_tool_use + result valid-pair path in unit suite

Covers the remaining patch-coverage lines codecov flagged: convert_to_anthropic_tool_invoke
emitting server_tool_use followed by its web_search_tool_result when the matching
result is present (the litellm-SDK round-trip path). Refs #17737

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(anthropic): flatten srvtoolu_ tool-message guard to a negated if

Addresses the Greptile style nit: replace the if-pass/else with a single negated
`if not (...)` guard around the tool_result append. Behavior unchanged. Refs #17737

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(proxy): require premium only when enabling premium metadata fields (#30285) (#30506)

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(perplexity): stop double-billing reasoning tokens in manual cost fallback (#30488)

* fix(perplexity): stop double-billing reasoning tokens in manual cost fallback

When perplexity_cost_per_token cannot use the API-provided usage.cost.total_cost short-circuit and falls back to manual calculation, it multiplies the full usage.completion_tokens by output_cost_per_token and then adds reasoning_tokens * output_cost_per_reasoning_token on top. Per the OpenAI/Perplexity usage convention codified for the central path in PR #18607, completion_tokens already INCLUDES reasoning_tokens, so the manual fallback double-bills reasoning at both the output and reasoning rate.

Concrete impact on perplexity/sonar-deep-research (input 2e-6, output 8e-6, reasoning 3e-6): for the exact usage shape exercised by the live response fixture in tests/llm_translation/test_perplexity_reasoning.py (prompt_tokens=9, completion_tokens=20, reasoning_tokens=15) the current code charges 0.000223 vs the convention-correct 0.000103, a 2.165x overcharge. The bug is reachable whenever Perplexity omits the cost object (streaming chunks, fixture-driven paths, older API versions).

Subtracts reasoning_tokens (clamped at zero) from completion_tokens before applying the output rate, mirroring how dashscope/cost_calculator.py and the central generic_cost_per_token already handle it. Preserves the existing fallback behaviour when output_cost_per_reasoning_token is unset (all completion_tokens stay at the output rate).

Existing tests in tests/test_litellm/llms/perplexity/test_perplexity_cost_calculator.py asserted the buggy math and are updated to the convention-correct math. Adds a focused regression test using the exact usage shape from the live response fixture so this class of bug cannot be silently reintroduced.

* style(perplexity): drop redundant type annotation on else branch to satisfy mypy

mypy [no-redef] flagged 'completion_cost' as declared in both if and else arms; keeping the annotation only on the first declaration matches existing patterns in this file.

* fix(perplexity): update integration test expected costs for non-double-billed math

Three tests in test_perplexity_integration.py asserted the old buggy expectation
that reasoning_tokens are billed in addition to the full completion_tokens
count. After the fix in cost_per_token, reasoning_tokens are billed at the
reasoning rate and the remaining (completion_tokens - reasoning_tokens) at the
standard output rate, matching OpenAI/Perplexity convention (PR #18607).

Updates: test_end_to_end_cost_calculation_with_transformation,
test_main_cost_calculator_integration, test_high_volume_cost_calculation.
The high-volume sanity threshold drops to 0.25 to reflect the corrected total.

* fix(ui): use dynamic proxy base URL in MCP usage examples (#30487)

Replace hardcoded http://localhost:4000 with getProxyBaseUrl() in the
MCP server usage example and copy-to-clipboard snippet so the generated
configuration works for non-local deployments.

Fixes #30466

* feat: add missing UK PII entity types to Presidio guardrail (#30537)

* feat: add missing UK PII entity types to Presidio guardrail

Add UK_PASSPORT, UK_POSTCODE, and UK_VEHICLE_REGISTRATION to PiiEntityType enum and PII_ENTITY_CATEGORIES_MAP. These entity types are supported by Microsoft Presidio but were missing from litellm's type definitions, preventing users from configuring UK-specific PII detection.

* test: remove fragile hardcoded entity count test

Remove test_uk_category_entity_count which hardcodes len() == 5. The test_uk_entities_match_presidio_recognizers test already verifies exact set equality, making the count test redundant and fragile to future Presidio additions.

* style: apply Black formatting to match CI requirements

* fix: route volcengine (Doubao) tiered-pricing models to the tiered cost handler (#30357)

Volcengine (Doubao) models define `tiered_pricing` but no flat per-token cost, so cost_per_token fell through to generic_cost_per_token (which only reads flat costs) and tracked them at $0

Route custom_llm_provider == "volcengine" to the shared tiered-pricing handler in litellm/llms/dashscope/cost_calculator.py, which already computes graduated tier costs. Make that handler provider-agnostic by adding a custom_llm_provider argument (default "dashscope" preserves existing behavior) so get_model_info resolves the correct model map entry

Fixes #30346

* feat(mcp): make MCP gateway name and description configurable via env vars (#30473)

* feat(mcp): make MCP gateway name and description configurable via env vars

* Rename function _restore_env to _apply_env

* docs(mcp): document import-time capture of env-backed identity constants

Address Greptile review feedback: clarify that LITELLM_MCP_SERVER_NAME and
LITELLM_MCP_SERVER_DESCRIPTION are read once at import and require a module
reload to observe env changes after import.

Generated with AI assistance

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Yevhen Luhovtsov <yevhen.luhovtsov@intapp.com>
Co-authored-by: Claude <noreply@anthropic.com>

* fix(mcp): preserve native tools in semantic filter hook (#26650)

* fix(mcp): preserve native tools in semantic filter hook

The SemanticToolFilterHook.async_pre_call_hook passed ALL tools (MCP +
native) to filter_tools(), which only knows MCP-registered tool names.
Native tools silently failed the name match in _get_tools_by_names()
and were dropped from the request.

Fix: partition tools into native and MCP-registered before filtering.
Run the semantic filter only on MCP tools, then merge native tools
back unconditionally.

Changes:
- Robust _is_mcp_tool() using shape-based detection for OpenAI-format
  dicts, safe regardless of future _extract_tool_info changes
- Single-pass partition loop (no double _is_mcp_tool calls)
- Preserve native tools in MCP expansion path (mixed requests)
- Track MCP expansion to prevent expanded tools bypassing filtering
- filter_stats reports MCP-only counts for accurate metrics
- Extracted _emit_filter_metadata() helper
- Skip spurious filter headers for all-native tool requests

Closes #26212

* remove stale docstring note referencing tools_expanded_from_mcp

* fix: handle Responses API name collision and preserve tool ordering

- Classify Responses API tools ({type: 'function', name: '...'}) as
  native to prevent name collisions with MCP canonical names
- Preserve original request tool ordering using id()-based merge
  instead of naive native+mcp concatenation
- Add 2 regression tests: name collision and ordering preservation

* style: apply black formatting

* fix(mcp): harden semantic filter — preserve all native tool formats, safe metadata access, graceful expansion failure, name-based merge

* lint: suppress PLR0915 on async_pre_call_hook (matches codebase convention)

* ci: retrigger checks after rebase onto litellm_internal_staging

* feat(fireworks): sync Fireworks AI model registry with current platform catalog (#30616)

Adds 12 new Fireworks serverless models and updates 3 existing entries in
model_prices_and_context_window.json and its bundled backup to match the
current Fireworks platform model list. New direct models: glm-5p2,
qwen3p7-plus, minimax-m3, minimax-m2p7, kimi-k2p7-code, kimi-k2p6,
deepseek-v4-pro, deepseek-v4-flash. New router endpoints: glm-5p1-fast,
kimi-k2p6-fast, kimi-k2p7-code-fast. Updated: glm-5p1, gpt-oss-120b, and
gpt-oss-20b now carry correct output token caps, cache-read pricing, and
explicit capability flags

max_tokens is set equal to max_output_tokens (not the full context window)
for models whose generation cap is below their context window. This avoids
the shared input+output budget path in get_modified_max_tokens, which would
otherwise let callers request output sizes the model cannot produce. The
same fix corrects the pre-existing glm-5p1, gpt-oss-120b, and gpt-oss-20b
entries that had max_tokens equal to the full context window

Short-form aliases (fireworks_ai/<model>) are added for every direct
accounts/fireworks/models/ entry so cost attribution works for callers
using bare model names. Router endpoints get short-form aliases too, and
transform_request now routes bare names ending in -fast to the
accounts/fireworks/routers/ path instead of defaulting every bare name to
models/. This keeps the kimi-k2p6-fast router from being misrouted to the
nonexistent models/kimi-k2p6-fast endpoint

kimi-k2p6-turbo is intentionally excluded; kimi-k2p6-fast is its
replacement. Context windows for deepseek-v4 and kimi models use the
power-of-two values (1048576 and 262144) published on the Fireworks model
pages, matching the convention already used by existing entries

Two regression tests in test_utils.py assert the exact per-token costs,
token limits, capability flags, and short-form-to-long-form equality for
all 15 models against both the main and backup cost maps. Two routing
tests in test_fireworks_ai_chat_transformation.py verify bare -fast names
route to routers/ and bare direct-model names route to models/

* fix(bedrock): handle role:"system" inside the messages array on /v1/messages (#29698) (#30443)

* feat(anthropic): hoist leading in-array system to top-level (helper)

* test(anthropic): cover _system_content_to_blocks edge cases; deepcopy cache_control

* test(anthropic): mid-conversation system normalization cases

* feat: add supports_mid_conversation_system flag to Claude Opus 4.8

Add supports_mid_conversation_system: true to all 9 claude-opus-4-8 cost-map
entries (Anthropic-native, Bedrock, Vertex, Azure AI) in both the root cost
map and the bundled package backup, since the runtime helper and tests read
the backup in local/offline mode.

Pin the mid-system passthrough regression test to the local cost map via the
existing local_model_cost_map fixture so it reads the branch-local flag rather
than the network-fetched main copy.

* fix(bedrock): normalize in-array system in /v1/messages handler (#29698)

Wire normalize_system_messages_for_anthropic into anthropic_messages_handler
so all Bedrock /v1/messages paths (Invoke / Mantle / ClaudePlatform /
Converse-bridge) hoist leading in-array system entries (and demote
mid-conversation ones on models lacking supports_mid_conversation_system) into
the top-level system field. The normalized messages/system are written back
into the local_vars snapshot the base_llm branch reads from, otherwise the
Invoke/Mantle fix would silently no-op.

Also fix the helper to resolve supports_mid_conversation_system through the
prefix-aware AnthropicModelInfo._supports_model_capability resolver. The raw
_supports_factory could not see the flag once get_llm_provider left the
invoke/ prefix on the model id, which would have wrongly demoted
mid-conversation system on a Bedrock invoke opus-4-8 path.

* fix(bedrock): resolve mid-conversation-system flag through mantle/invoke/converse route prefixes; drop unused param

* fix(types): widen system param to Union[str, List] for hoisted system blocks

* refactor(bedrock): drop dead local_vars messages writeback

* fix(bedrock/converse): translate in-array system in anthropic->openai adapter (#29698)

* fix(bedrock/converse): preserve cache_control on in-array system; test drop-empty

* fix(bedrock/converse): rename colliding local to satisfy mypy; test handler system-merge branches

* fix(types): register supports_mid_conversation_system in model-info schema

The cost-map JSON-schema validation test (test_aaamodel_prices_and_context_window_json_is_valid)
rejects unknown properties, so adding supports_mid_conversation_system to the opus-4-8
cost-map entries failed CI with 'Additional properties are not allowed'. Register the flag
in the INTENDED_SCHEMA allow-list and in the ProviderSpecificModelInfo TypedDict so it is a
typed, first-class capability flag alongside its peers (supports_output_config, etc.).

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(bedrock/agentcore): optionally forward multimodal content blocks in InvokeAgentRuntime payload (#28885)

* fix(bedrock/agentcore): optionally forward multimodal content blocks in InvokeAgentRuntime payload

By default the agentcore provider flattens the last message to a text-only
{"prompt": "..."} payload via convert_content_list_to_str, silently dropping
OpenAI multimodal blocks (image_url, file, input_audio, ...).

This adds an opt-in `forward_multimodal_content` litellm param. When truthy and
the last message's content is a list containing a non-text block, the original
OpenAI content list is forwarded verbatim under a new "content" field so an
attachment-aware AgentCore agent can read it. Default off keeps the payload
byte-identical to the legacy {"prompt": "..."} shape — existing agents are
unaffected.

The flag is read from optional_params (where other AgentCore params land) with a
litellm_params fallback, and accepts a bool or a config/env string ('true', '1', ...).

AgentCore Runtime is schemaless on the agent side — the agent's @app.entrypoint
parses arbitrary JSON up to 100 MB (per
https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-invoke-agent.html),
so this is a purely upstream change; no AgentCore-side schema is asserted.

* fix(bedrock/agentcore): shallow-copy forwarded multimodal content list

Address review feedback (Sameerlite): payload["content"] = last_content
aliased the caller's mutable messages[-1]["content"] list. Harmless today
because the payload is JSON-serialized immediately, but a latent footgun if
a future caller mutates the returned payload before serialization. Forward
list(last_content) so the payload owns its own list. Block dicts stay shared
on purpose — a deep copy would clone potentially large base64 media on the
request hot path, and the flagged risk was the shared list, not the blocks.

Update the passthrough tests to assert equality + distinct identity, and add
a regression test that mutating the payload list can't leak back into the
original message content.

* Revert "fix(mcp): preserve native tools in semantic filter hook (#26650)"

This reverts commit 438c825.

* Revert "feat(guardrails): integrate Repelloai Argus guardrail (#30465)"

This reverts commit 54da785.

* Revert "feat(dashscope): add Responses API support (#30286)"

This reverts commit 6766256.

* Revert "fix(bedrock): handle role:"system" inside the messages array on /v1/messages (#29698) (#30443)"

This reverts commit b8a8083.

* Revert "fix(anthropic): drop orphaned server_tool_use on multi-turn replay from generic OpenAI clients (#30486)"

This reverts commit 6e9c0b0.

* Revert "fix: route volcengine (Doubao) tiered-pricing models to the tiered cost handler (#30357)"

This reverts commit 172e302.

* Revert "feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery (#30273)"

This reverts commit 4e31885.

* fix: pass key_limit=None in team_member_update and patch model_cost in pricing test

team_member_update called team_info without key_limit, so the fastapi.Query
default object (not None) was passed through to get_data, which failed when
serializing it. Pass key_limit=None explicitly to avoid this.

test_get_model_info_costs patched litellm.model_cost from the local backup so
the assertion holds before the PR is merged and the remote main URL is updated.

* fix(security): validate resolved model in /realtime/client_secrets for non-transcription sessions (#30710)

Omitting both model and session.model caused the endpoint to default to
gpt-4o-realtime-preview without running can_key_call_resolved_model, so
any key could access that model regardless of its allowed-model list.

The transcription path already called can_key_call_resolved_model; this
adds the same call for the realtime path before returning.

* fix(lint): fix F821 undefined model_info and F841 unused metadata in create_model_info_response

* fix: black formatting and stub get_model_group_info in third team translation test

* fix: reformat utils.py with black 26.3.1 to match CI

* fix: replace Optional[X] with X | None to satisfy UP045 ruff strict gate

---------

Co-authored-by: Habon Laszlo <habonlaci@users.noreply.github.com>
Co-authored-by: habonlaci <4699494+habonlaci@users.noreply.github.com>
Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com>
Co-authored-by: santino18727-debug <santino18727@gmail.com>
Co-authored-by: Eric (GabiDevFamily) <271972409+santino18727-debug@users.noreply.github.com>
Co-authored-by: Nitish Agarwal <1592163+nitishagar@users.noreply.github.com>
Co-authored-by: jho1-godaddy <171078705+jho1-godaddy@users.noreply.github.com>
Co-authored-by: 安妮的心动录 <74543653+anneheartrecord@users.noreply.github.com>
Co-authored-by: Harshith Gujjeti <153299927+Harshxth@users.noreply.github.com>
Co-authored-by: Tomoya Tabuchi <t@tomoyat1.com>
Co-authored-by: Vedant Agarwal <43557509+Vedant-Agarwal@users.noreply.github.com>
Co-authored-by: Prathamesh Jadhav <55660103+lollinng@users.noreply.github.com>
Co-authored-by: songkuan-zheng <252822057+songkuan-zheng@users.noreply.github.com>
Co-authored-by: Kropiunig <48442031+Kropiunig@users.noreply.github.com>
Co-authored-by: Lavish Bansal <lavish.bansal619@gmail.com>
Co-authored-by: Shane Emmons <27679+semmons99@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Anuj ojha <ojhaanuj224@gmail.com>
Co-authored-by: Nahrin <nahrin@nahrinoda.com>
Co-authored-by: Nbouyaa <67773915+FadelT@users.noreply.github.com>
Co-authored-by: Vineeth Sai <vineethsai4444@gmail.com>
Co-authored-by: Eugene Lugovtsov <34510252+EugeneLugovtsov@users.noreply.github.com>
Co-authored-by: Yevhen Luhovtsov <yevhen.luhovtsov@intapp.com>
Co-authored-by: Ayush Shekhar <106994833+ayushh0110@users.noreply.github.com>
Co-authored-by: Ahmad Shahzad <107808273+shzdehmd@users.noreply.github.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: Jón Levy <levy@apro.is>
@Ar-maan05 Ar-maan05 deleted the feat-anthropic-native-v1-models branch June 19, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support Anthropic-native response format for /v1/models endpoint (Claude Code gateway discovery)

2 participants