Skip to content

fix(anthropic,bedrock): omit thinking/output_config when reasoning_effort="none"#27039

Merged
mateo-berri merged 2 commits into
litellm_internal_stagingfrom
litellm_fix_reasoning_effort_none_anthropic
May 2, 2026
Merged

fix(anthropic,bedrock): omit thinking/output_config when reasoning_effort="none"#27039
mateo-berri merged 2 commits into
litellm_internal_stagingfrom
litellm_fix_reasoning_effort_none_anthropic

Conversation

@mateo-berri

@mateo-berri mateo-berri commented May 2, 2026

Copy link
Copy Markdown
Collaborator

Linear ticket

Resolves LIT-2758

Pre-Submission checklist

  • I have added tests in tests/test_litellm/
  • My PR passes all unit tests
  • PR scope is isolated to reasoning_effort → Anthropic thinking mapping

What this PR does

Setting reasoning_effort="none" on any Anthropic-backed chat model crashed LiteLLM with:

litellm.APIConnectionError: 'NoneType' object has no attribute 'get'

AnthropicConfig._map_reasoning_effort returns None for reasoning_effort="none", but two callers were assigning that None directly to optional_params["thinking"]. The downstream is_thinking_enabled then ran optional_params["thinking"].get("type") and threw an AttributeError. This affected every provider that routes through AnthropicConfig for reasoning_effort mapping — Anthropic, Bedrock Converse, Bedrock Invoke, Vertex AI Anthropic, and Azure AI Anthropic — including the Claude 4.6 / 4.7 adaptive-thinking path the customer originally hit.

1. Anthropic chat transformation

litellm/llms/anthropic/chat/transformation.py

When _map_reasoning_effort returns None, pop thinking (and on Claude 4.6/4.7 also output_config) from optional_params instead of assigning None. This restores the documented contract that reasoning_effort="none" means "do not enable thinking" and avoids the NoneType crash.

This single change covers Anthropic direct, Bedrock Invoke (AmazonAnthropicClaudeConfig calls AnthropicConfig.map_openai_params directly), Vertex AI Anthropic (VertexAIAnthropicConfig calls super().map_openai_params(...)), and Azure AI Anthropic (AzureAnthropicConfig inherits map_openai_params).

2. Bedrock Converse

litellm/llms/bedrock/chat/converse_transformation.py

AmazonConverseConfig._handle_reasoning_effort_parameter calls AnthropicConfig._map_reasoning_effort directly and assigned the result to optional_params["thinking"] itself, bypassing the fix in map_openai_params. Apply the same pop-when-None pattern so Bedrock Converse Anthropic models (Opus 4.5/4.6/4.7, Sonnet 4.5/4.6) are also covered.

Testing

tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py

  • test_reasoning_effort_none_omits_thinking_and_output_config (parametrized over claude-opus-4-5-20251101, claude-opus-4-6-20250514, claude-sonnet-4-6-20260219, claude-opus-4-7)

tests/test_litellm/llms/bedrock/chat/test_converse_transformation.py

  • test_reasoning_effort_none_omits_thinking_for_anthropic_converse (parametrized over bedrock/converse/us.anthropic.claude-opus-4-{5,6,7})

All 7 new tests pass and existing reasoning_effort/thinking tests in both files continue to pass.

Type

🐛 Bug Fix


Note

Medium Risk
Changes request-parameter mapping for Anthropic-backed models so reasoning_effort="none" removes fields instead of sending thinking=None, which could affect downstream behavior for callers relying on prior (buggy) serialization. Limited scope and covered by new unit tests across Anthropic and Bedrock Converse model variants.

Overview
Fixes reasoning_effort="none" handling for Anthropic-backed chat models by not writing thinking=None; instead it removes thinking (and for Claude 4.6/4.7 also output_config) from the outgoing request params.

Applies the same behavior in Bedrock Converse’s reasoning-effort handler, and adds targeted tests asserting the omission behavior across multiple Claude 4.5/4.6/4.7 model IDs.

Reviewed by Cursor Bugbot for commit 3835306. Bugbot is set up for automated code reviews on this repo. Configure here.

@mateo-berri

mateo-berri commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

QA: end-to-end sweep for reasoning_effort mapping

This PR fixes the reasoning_effort="none" NoneType.get crash on Anthropic-backed routes. The sweep below covers all 21 (provider × model) cells across 11 effort values (231 cells), matching wire body against expected mapping.

Proxy config used

model_list:
  - model_name: claude-opus-4-6
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-opus-4-7
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-opus-4-5
    litellm_params:
      model: anthropic/claude-opus-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: bedrock-claude-opus-4-5
    litellm_params:
      model: bedrock/converse/us.anthropic.claude-opus-4-5-20251101-v1:0
      aws_region_name: us-east-1
  - model_name: bedrock-claude-sonnet-4-5
    litellm_params:
      model: bedrock/converse/us.anthropic.claude-sonnet-4-5-20250929-v1:0
      aws_region_name: us-east-1
  - model_name: bedrock-claude-opus-4-6
    litellm_params:
      model: bedrock/converse/us.anthropic.claude-opus-4-6-v1
      aws_region_name: us-east-1
  - model_name: bedrock-claude-sonnet-4-6
    litellm_params:
      model: bedrock/converse/us.anthropic.claude-sonnet-4-6
      aws_region_name: us-east-1
  - model_name: bedrock-claude-opus-4-7
    litellm_params:
      model: bedrock/converse/us.anthropic.claude-opus-4-7
      aws_region_name: us-east-1
  - model_name: bedrock-invoke-claude-opus-4-5
    litellm_params:
      model: bedrock/invoke/us.anthropic.claude-opus-4-5-20251101-v1:0
      aws_region_name: us-east-1
  - model_name: bedrock-invoke-claude-opus-4-6
    litellm_params:
      model: bedrock/invoke/us.anthropic.claude-opus-4-6-v1
      aws_region_name: us-east-1
  - model_name: bedrock-invoke-claude-sonnet-4-6
    litellm_params:
      model: bedrock/invoke/us.anthropic.claude-sonnet-4-6
      aws_region_name: us-east-1
  - model_name: azure-claude-opus-4-7
    litellm_params:
      model: azure_ai/claude-opus-4-7
      api_base: os.environ/AZURE_FOUNDRY_API_BASE
      api_key: os.environ/AZURE_FOUNDRY_API_KEY
  - model_name: azure-claude-sonnet-4-6
    litellm_params:
      model: azure_ai/claude-sonnet-4-6
      api_base: os.environ/AZURE_FOUNDRY_API_BASE
      api_key: os.environ/AZURE_FOUNDRY_API_KEY
  - model_name: azure-claude-haiku-4-5
    litellm_params:
      model: azure_ai/claude-haiku-4-5
      api_base: os.environ/AZURE_FOUNDRY_API_BASE
      api_key: os.environ/AZURE_FOUNDRY_API_KEY
  - model_name: vertex-claude-haiku-4-5
    litellm_params:
      model: vertex_ai/claude-haiku-4-5
      vertex_project: vertex-check-481318
      vertex_location: us-east5
  - model_name: vertex-claude-sonnet-4-6
    litellm_params:
      model: vertex_ai/claude-sonnet-4-6
      vertex_project: vertex-check-481318
      vertex_location: us-east5
  - model_name: vertex-claude-opus-4-7
    litellm_params:
      model: vertex_ai/claude-opus-4-7
      vertex_project: vertex-check-481318
      vertex_location: global
  - model_name: vertex-claude-opus-4-6
    litellm_params:
      model: vertex_ai/claude-opus-4-6
      vertex_project: vertex-check-481318
      vertex_location: us-east5
  - model_name: azure-claude-opus-4-6
    litellm_params:
      model: azure_ai/claude-opus-4-6
      api_base: os.environ/AZURE_FOUNDRY_API_BASE
      api_key: os.environ/AZURE_FOUNDRY_API_KEY
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-haiku-4-5
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  drop_params: false

Started with: litellm --config /tmp/proxy_test_config.yaml --port 4222 --detailed_debug (with this PR's branch checked out, DATABASE_URL unset).

Curl for any cell

curl -sS -X POST http://0.0.0.0:4222/v1/chat/completions \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<alias-from-config>",
    "messages": [{"role": "user", "content": "Step by step, calculate 47 * 53. Show your work."}],
    "max_tokens": 200,
    "reasoning_effort": "<effort>"
  }'
  • max_tokens=200 for adaptive (4.6/4.7) cells; max_tokens=8192 for budget (4.5) cells (so budget_tokens<max_tokens).
  • For __omit__ rows: omit the reasoning_effort field entirely.
  • For Bedrock Invoke /v1/messages rows: change the path to /v1/messages.

Results

Cell format (4.6/4.7): <✅|❌> status: <code>, thinking.type: <type|omit>, output_config.effort: <effort|omit>
Cell format (4.5): <✅|❌> status: <code>, max_tokens: <max>, thinking.type: <enabled|omit>, thinking.budget_tokens: <n>
Mark is ❌ when wire body or status diverges from the mapping the cell should produce (silent strip, silent garbage acceptance, knob loss, etc.).

Anthropic direct (/v1/chat/completions)

effort opus-4-7 sonnet-4-6 haiku-4-5
(omit) ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
none ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
minimal ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 400, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 128
low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 1024
medium ✅ status: 200, thinking.type: adaptive, output_config.effort: medium ✅ status: 200, thinking.type: adaptive, output_config.effort: medium ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 2048
high ✅ status: 200, thinking.type: adaptive, output_config.effort: high ✅ status: 200, thinking.type: adaptive, output_config.effort: high ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 4096
xhigh ✅ status: 200, thinking.type: adaptive, output_config.effort: xhigh ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
max ✅ status: 200, thinking.type: adaptive, output_config.effort: max ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
disabled ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
invalid ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
"" ✅ status: 400, thinking.type: adaptive, output_config.effort: "" ✅ status: 400, thinking.type: adaptive, output_config.effort: "" ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —

Azure AI Foundry (/v1/chat/completions)

effort opus-4-7 opus-4-6 sonnet-4-6 haiku-4-5
(omit) ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
none ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
minimal ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 400, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 128
low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, thinking.type: adaptive, output_config.effort: low ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 1024
medium ✅ status: 200, thinking.type: adaptive, output_config.effort: medium ✅ status: 200, thinking.type: adaptive, output_config.effort: medium ✅ status: 200, thinking.type: adaptive, output_config.effort: medium ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 2048
high ✅ status: 200, thinking.type: adaptive, output_config.effort: high ✅ status: 200, thinking.type: adaptive, output_config.effort: high ✅ status: 200, thinking.type: adaptive, output_config.effort: high ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 4096
xhigh ✅ status: 200, thinking.type: adaptive, output_config.effort: xhigh ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
max ✅ status: 200, thinking.type: adaptive, output_config.effort: max ✅ status: 200, thinking.type: adaptive, output_config.effort: max ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
disabled ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
invalid ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
"" ✅ status: 400, thinking.type: adaptive, output_config.effort: "" ✅ status: 400, thinking.type: adaptive, output_config.effort: "" ✅ status: 400, thinking.type: adaptive, output_config.effort: "" ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —

Vertex AI (/v1/chat/completions)

effort opus-4-7 opus-4-6 sonnet-4-6 haiku-4-5
(omit) ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
none ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
minimal ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 400, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 128
low ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 1024
medium ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 2048
high ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 4096
xhigh ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
max ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
disabled ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
invalid ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
"" ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —

Bedrock Converse (/v1/chat/completions)

effort opus-4-7 opus-4-6 sonnet-4-6 sonnet-4-5
(omit) ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
none ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
minimal ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 1024
low ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 1024
medium ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 2048
high ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 4096
xhigh ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
max ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
disabled ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
invalid ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
"" ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —

Bedrock Invoke (/v1/chat/completions)

effort opus-4-6 sonnet-4-6 opus-4-5
(omit) ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
none ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
minimal ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 400, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 128
low ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 1024
medium ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 2048
high ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: enabled, thinking.budget_tokens: 4096
xhigh ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
max ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
disabled ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
invalid ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, thinking.type: omit, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —
"" ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ❌ status: 200, thinking.type: adaptive, output_config.effort: omit ✅ status: 500, max_tokens: —, thinking.type: omit, thinking.budget_tokens: —

Bedrock Invoke (/v1/messages)

effort opus-4-6 sonnet-4-6 opus-4-5
(omit) ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
none ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
minimal ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
low ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
medium ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
high ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
xhigh ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
max ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
disabled ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
invalid ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —
"" ❌ status: 200, thinking.type: omit, output_config.effort: omit ❌ status: 200, thinking.type: omit, output_config.effort: omit ✅ status: 200, max_tokens: 8192, thinking.type: omit, thinking.budget_tokens: —

Bugs remaining

  1. Effort knob silently lost on Bedrock + Vertex adaptive routes. All Bedrock (Converse + Invoke /chat) and Vertex routes strip output_config.effort before wire — low/medium/high/xhigh/max all produce identical adaptive thinking with no tier differentiation. (bedrock/chat/converse_transformation.py:1197,1207-1209, bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py:172, vertex_ai/vertex_ai_partner_models/anthropic/output_params_utils.py:13-50.)
  2. Vertex effort strip is unjustified. Direct :rawPredict curls to Vertex claude-opus-4-7 (global) and 4-6 (us-east5) accept output_config.effort. The strip site's source comment claiming Vertex doesn't support it is stale.
  3. Bedrock Converse silent footgun on garbage efforts. disabled/invalid/"" return 200 with adaptive thinking instead of 500 — Converse never invokes _apply_output_config validation. (bedrock/chat/converse_transformation.py:_handle_reasoning_effort_parameter.)
  4. Bedrock Invoke /chat and Vertex silently accept effort="". Empty string slips past validation and produces 200 with adaptive thinking.
  5. Bedrock Invoke /v1/messages ignores reasoning_effort entirely — allowlist filter at types/llms/bedrock.py:1002-1043 drops thinking and output_config keys; route is a no-op for reasoning.
  6. disabled/invalid/unmapped efforts produce 500 instead of 400. _map_reasoning_effort and _apply_output_config raise ValueError which surfaces as 500. Should be a clean client-side 400.
  7. minimalbudget_tokens=128 is always rejected by direct Anthropic / Azure / Vertex / Bedrock Invoke (provider min is 1024). Bedrock Converse clamps to 1024. Either bump the mapping to 1024 across the board or document the discrepancy.
  8. xhigh/max on budget-mode (4.5) models produce 500. No mapping exists; should be a clean 400 saying these tiers aren't defined for budget-mode models.
  9. sonnet-4-6 supports_max_reasoning_effort JSON entry vs runtime gating. Confirm the model_prices entry matches the actual gate (Vertex/Anthropic both reject max on sonnet-4-6 currently — gate is correct, but double-check JSON metadata is consistent).

@greptile-apps

greptile-apps Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a NoneType crash that occurred when reasoning_effort="none" was passed to any Anthropic-backed model. AnthropicConfig._map_reasoning_effort already returned None for that value, but both callers were assigning the result directly to optional_params["thinking"], causing is_thinking_enabled to call .get() on None. The fix introduces a guard in both the Anthropic map_openai_params and the Bedrock Converse _handle_reasoning_effort_parameter, popping thinking (and output_config on the Anthropic path) instead of assigning None.

Confidence Score: 5/5

Safe to merge — minimal, well-targeted fix with regression tests covering all affected paths.

The change is a narrow guard around an existing function's return value; the fix pattern is applied consistently across both affected callers. The post-pop call to update_optional_params_with_thinking_tokens correctly receives an optional_params dict without thinking or reasoning_effort, so is_thinking_enabled returns False and no max_tokens injection occurs. Seven new tests exercise the failure scenario end-to-end, and no existing tests are weakened.

No files require special attention.

Important Files Changed

Filename Overview
litellm/llms/anthropic/chat/transformation.py Restructures the reasoning_effort branch to first capture _map_reasoning_effort's return value and pop thinking/output_config when it's None, fixing the NoneType crash for reasoning_effort="none".
litellm/llms/bedrock/chat/converse_transformation.py Applies the same pop-when-None pattern to _handle_reasoning_effort_parameter, closing the parallel crash path in the Bedrock Converse handler.
tests/test_litellm/llws/anthropic/chat/test_anthropic_chat_transformation.py Adds parametrized regression test covering four model variants; minor whitespace-only reformatting of two existing pytest.raises blocks does not weaken coverage.
tests/test_litellm/llws/bedrock/chat/test_converse_transformation.py Adds parametrized regression test for three Bedrock Converse model variants, verifying thinking is absent when reasoning_effort="none" is passed.

Reviews (2): Last reviewed commit: "fix(anthropic,bedrock): omit thinking/ou..." | Re-trigger Greptile

Comment thread tests/test_litellm/llms/bedrock/chat/test_converse_transformation.py Outdated
…fort="none"

Setting reasoning_effort="none" on Anthropic chat models (direct, Bedrock
Invoke, Bedrock Converse, Vertex AI Anthropic, Azure AI Anthropic) crashed
LiteLLM with:

  litellm.APIConnectionError: 'NoneType' object has no attribute 'get'

Both the Anthropic chat transformation and Bedrock Converse called
``AnthropicConfig._map_reasoning_effort`` and assigned the ``None`` it returns
for ``"none"`` directly to ``optional_params["thinking"]``. Downstream
``is_thinking_enabled`` then did ``optional_params["thinking"].get("type")``
and crashed.

Pop ``thinking`` (and on Claude 4.6/4.7, ``output_config``) instead of
assigning ``None``, restoring the documented contract that
``reasoning_effort="none"`` means "do not enable thinking". This also
prevents downstream Anthropic 400s ("thinking: Input should be an object",
"output_config.effort: Input should be ...") if the bug were ever masked.

Verified end-to-end against the live Anthropic API and Bedrock Converse
on claude-opus-4-{5,6,7} and claude-sonnet-4-6, plus Bedrock Invoke for
Claude 4.5/4.6. Vertex AI Anthropic and Azure AI Anthropic inherit the
fixed ``map_openai_params`` from ``AnthropicConfig`` and need no further
changes.
@mateo-berri mateo-berri force-pushed the litellm_fix_reasoning_effort_none_anthropic branch from 619914b to 3835306 Compare May 2, 2026 08:08
@mateo-berri

Copy link
Copy Markdown
Collaborator Author

@greptileai

@mateo-berri mateo-berri requested a review from Sameerlite May 2, 2026 08:19

@Sameerlite Sameerlite left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just make sure cicd is passing before merging

@mateo-berri

Copy link
Copy Markdown
Collaborator Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 3835306. Configure here.

…itellm_fix_reasoning_effort_none_anthropic

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
@mateo-berri mateo-berri enabled auto-merge May 2, 2026 08:40
@mateo-berri mateo-berri merged commit c94a8d6 into litellm_internal_staging May 2, 2026
96 of 102 checks passed
@mateo-berri mateo-berri deleted the litellm_fix_reasoning_effort_none_anthropic branch May 2, 2026 08:42
cursor Bot pushed a commit that referenced this pull request May 2, 2026
The original commit added a new env var
DEFAULT_REASONING_EFFORT_MINIMAL_THINKING_BUDGET_ANTHROPIC, but the
documentation validation CI (tests/documentation_tests/test_env_keys.py)
requires every os.getenv key in the LiteLLM source to have a matching
entry in the litellm-docs config_settings.md table — and that file lives
in a separate repo.

Drop the new env var and reuse the existing
DEFAULT_REASONING_EFFORT_LOW_THINKING_BUDGET=1024 constant, which is
both the LiteLLM 'low' budget and the Anthropic API minimum
(thinking.enabled.budget_tokens >= 1024). Net behavior is identical:
reasoning_effort='minimal' on the pre-4.6 budget_tokens path now emits
budget_tokens=1024 instead of 128 — clearing the 400
'thinking.enabled.budget_tokens: Input should be greater than or equal
to 1024' on every Anthropic-backed provider.

The QA comment on PR #27039 explicitly suggested this consolidation
('the current minimal mapping would benefit from a bump to 1024
matching low').

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request May 2, 2026
Greptile P2 on PR #27053: the docstring still cited 'effort' as an
example of an unsupported key dropped by the sanitizer, but
VERTEX_UNSUPPORTED_OUTPUT_CONFIG_KEYS is now empty (Vertex 4.6+ accepts
output_config.effort, per direct :rawPredict curls in the PR #27039 QA
matrix). Update the docstring to reflect the actual current behavior:
the helper is effectively a passthrough plus a defensive non-dict
guard, and the filtering scaffold is kept so that adding a new
Vertex-unsupported key would be a one-line frozenset edit.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
mateo-berri added a commit that referenced this pull request May 3, 2026
…garbage reasoning_effort

Follow-up bugs surfaced by the QA sweep on PR #27039
(#27039 (comment)).

1. Stop stripping output_config.effort on Bedrock + Vertex adaptive routes.
   - Vertex AI Claude 4.6/4.7 accepts output_config.effort on rawPredict
     (verified end-to-end against us-east5 / global). The strip helper now
     no-ops for effort.
   - Bedrock Converse routes output_config into additionalModelRequestFields
     for anthropic base models so the requested adaptive tier (low/medium/
     high/xhigh/max) actually reaches the wire instead of all collapsing to
     identical thinking.
   - Bedrock Invoke chat transformation (AmazonAnthropicClaudeConfig) stops
     popping output_config from the post-AnthropicConfig request body.
   - Bedrock Invoke /v1/messages allowlist (BedrockInvokeAnthropicMessagesRequest)
     now lists output_config so the runtime allowlist filter forwards it.

2. Validate effort across Bedrock Converse so 'disabled' / 'invalid' / '' /
   unsupported tiers (xhigh/max on Sonnet 4.6 or budget-mode 4.5 models)
   surface as a clean 400 BadRequestError instead of 500.

3. ValueError -> BadRequestError throughout (AnthropicConfig.map_openai_params,
   _apply_output_config, AmazonConverseConfig._handle_reasoning_effort_parameter).
   Empty-string effort is now rejected (was silently passing the
   'if effort and ...' short-circuit).

4. Floor reasoning_effort='minimal' at the Anthropic provider minimum
   (1024 budget_tokens) via new ANTHROPIC_MIN_THINKING_BUDGET_TOKENS so it's
   a usable tier on direct Anthropic / Azure AI Anthropic / Vertex AI Anthropic /
   Bedrock Invoke (all of which 400 below 1024).

5. model_prices: dedupe duplicate supports_max_reasoning_effort key on
   claude-opus-4-7 / claude-opus-4-7-20260416.

Adds regression tests across all five affected paths; existing tests asserting
the silent-strip behavior were updated to reflect the new pass-through and
clean 400 surfaces.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
yuneng-berri added a commit that referenced this pull request May 5, 2026
* default requested_model to empty string on litellm-side rejects

* Update litellm/router.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: scope key access_group_ids override by team's assigned groups

A team member could set any access_group_ids on their key (e.g. a group
assigned only to a different team) and override the team's model
restriction. Intersect the key's access_group_ids with team_object.access_group_ids
in _key_access_group_grants_model so foreign groups are dropped before
model expansion. Adds a regression test that asserts expansion is never
called for foreign groups.

* [Fix] Proxy: Skip Personal Budget Hook When Reservation Covers Counter

The reservation path (PR #26845) atomically pre-fills `spend:user:{user_id}`
and admits at the strict-`<` boundary. The legacy `_PROXY_MaxBudgetLimiter`
pre-call hook re-reads the same counter with `>=`, so a reservation that
fills the counter to exactly `max_budget` (e.g. a request without a
`max_tokens` cap that falls back to reserving the smallest remaining
headroom) is rejected by the hook even though the reservation already
admitted it.

Skip the hook when the request's active `budget_reservation` covers
`spend:user:{user_id}`. The reservation is the source of truth for that
counter cross-pod; the legacy `>=` path remains in place for requests
without a reservation (e.g. paths that bypass the reservation entirely).

Reproduces as `tests/otel_tests/test_prometheus.py::test_user_budget_metrics`
on a fresh user with `max_budget=10` calling `fake-openai-endpoint` without
`max_tokens`. Adds focused unit coverage in
`tests/test_litellm/proxy/hooks/test_max_budget_limiter.py`.

* harden bedrock file bucket validation

* Fix syntax errors from botched merge in router.py

* Fix Vertex batch output edge cases

* [Fix] RBAC: Drop management_routes Write Fallback for Admin Viewer

Greptile P1: the unsafe-method branch of `_check_proxy_admin_viewer_access`
ended with a blanket `if route in management_routes: return`. That set is a
mix of reads (info/list — handled via the safe-method GET branch above) and
writes. The fallback let Admin Viewer POST to write endpoints not enumerated
in `_ADMIN_VIEWER_BLOCKED_WRITE_ROUTES`, including:
  - /team/block, /team/unblock, /team/permissions_update
  - /jwt/key/mapping/{new,update,delete}
  - /key/bulk_update
  - /key/{key_id}/reset_spend

Remove the fallback. The two remaining allow sets (admin_viewer_routes and
global_spend_tracking_routes) are both read-only, so removal does not affect
the legitimate POST-as-read cases (e.g. /spend/calculate, which is in
spend_tracking_routes ⊂ admin_viewer_routes).

Tests:
  - 8 new parametrized cases pinning each previously-leaking management write
    endpoint to 403 on POST for PROXY_ADMIN_VIEW_ONLY.

* fix(tests): anchor VCR redis cassette key to repo root

`os.path.relpath` with no `start` arg uses the current working
directory, so running pytest from a subdirectory produced a
different Redis key than running from the repo root. CI-recorded
cassettes and locally-replayed runs would silently miss each
other's cache.

Anchor the path to the repo root (derived from `__file__`) so the
key is stable regardless of CWD.

https://claude.ai/code/session_018uCx7pcrkdUJZrCVMaTdPx

* fix: gate key access_group override on group's own assignment

Replaces the previous intersect-with-team.access_group_ids check, which
made the override unreachable in practice (the team-gate fallback already
covered every case the intersection allowed). The override now resolves
each of the key's access_group_ids via get_access_object and accepts the
group only if its assigned_team_ids includes the key's team_id, or its
assigned_key_ids includes the key's token. This fulfills the original ask
(a key can extend a team's allow-list via a group the admin granted to
that team or that specific key) while still rejecting foreign groups
referenced by team members of other teams.

* [Fix] Proxy/Key Management: Honor team_member_permissions /key/list In /key/list Endpoint

When a team grants /key/list via team_member_permissions, non-admin members
should see all keys for that team — same as a team admin. Previously the
classification in list_keys() only checked admin status, so permitted
members fell into the service-account-only path and could not see other
members' personal keys. Routes those members into the full-visibility set.

* Fix access-group bypass via litellm-model fallback path

When _get_all_deployments returns 0 candidates and the litellm-model
fallback branch (_get_deployment_by_litellm_model) finds deployments that
the access-group filter then empties, _access_group_filter_emptied_candidates
remained False (it was captured before that branch ran). The router would
then proceed to default fallbacks; the fallback model could have no
access_groups and short-circuit the filter, silently serving a caller
blocked by access-group restrictions.

Update the flag inside the litellm-model branch when filtering empties a
non-empty candidate set so the default-fallback guard still triggers.

* fix(proxy): redact MCP server URL and headers for non-admin viewers (VERIA-8)

Many MCP integrations (Zapier, etc.) embed an upstream API key
directly in the server URL, e.g.
``https://actions.zapier.com/mcp/<api-key>/sse``. The list and
single-server endpoints were returning the full URL to any
authenticated user — `_redact_mcp_credentials` only stripped the
explicit ``credentials`` field, and `_sanitize_mcp_server_for_virtual_key`
only ran for restricted virtual keys. Non-admin internal users could
read the dashboard, click the unmask toggle, and exfiltrate the raw
token.

Add `_sanitize_mcp_server_for_non_admin` that runs on top of the
existing credential redaction and clears the credential-bearing
fields:

- ``url`` (the primary leak vector)
- ``spec_path`` (OpenAPI spec URLs that may carry tokens)
- ``static_headers`` / ``extra_headers`` (Authorization)
- ``env`` (arbitrary secrets)
- ``authorization_url`` / ``token_url`` / ``registration_url``

Identity fields (``server_id``, ``alias``, ``mcp_info``, etc.) are
preserved so the UI can still list servers a non-admin's team has
access to.

Apply the new sanitizer in `fetch_all_mcp_servers` and the per-server
fetch path right after the existing virtual-key branch. Update the
existing `test_list_mcp_servers_non_admin_user_filtered` assertions
that previously checked URL visibility.

Frontend defense-in-depth: hide the URL unmask toggle on
`mcp_server_view.tsx` unless the viewer is a proxy admin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix runtime policy attachment initialization

Mark runtime-created policies and attachments initialized so global policy attachments created from the policy builder apply immediately without requiring a restart.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(router): cover _try_early_resolve_deployments_for_model_not_in_names

The router_code_coverage CI check requires every function in router.py to
be referenced by at least one test under tests/{local_testing,
router_unit_tests,test_litellm} in a file with "router" in its name.
The recently-extracted helper had no direct test, so the check failed
with "0.45% of functions in router.py are not tested".

Add a focused test that exercises the four return paths: model already
in self.model_names, no fallback applies, pattern-router match, and
default_deployment substitution (also asserting the stored default
isn't mutated).

https://claude.ai/code/session_019AVp1XL7RT9RxRe4qRLkay

* Fix policy registry teardown in tests

Reset the policy ID index during policy engine test cleanup so stale policy versions cannot leak between tests.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(batches): count non-chat tokens, validate batch-file model access (VERIA-39) (#27015)

* fix(batches): count non-chat tokens and validate every model in batch file

Two security control bypasses on POST /v1/batches:

1. `_get_batch_job_input_file_usage` only summed tokens for
   `body.messages` (chat completions). Embedding (`input`) and text
   completion (`prompt`) batches reported zero, letting massive
   non-chat workloads slip past TPM rate limits. Extend the counter
   to handle string and list shapes for both fields.

2. The batch input file was forwarded to the upstream provider
   without inspecting the models named inside the JSONL — only the
   outer `model` query parameter was checked against the caller's
   allowlist. A caller restricted to gpt-3.5 could submit a batch
   targeting gpt-4o and the upstream would execute it under the
   proxy's shared API key.

Add `_get_models_from_batch_input_file_content` (returns the
distinct `body.model` values) and call it from
`_enforce_batch_file_model_access` in the pre-call hook, which runs
each model through `can_key_call_model` so the same allowlist
semantics (wildcards, access groups, all-proxy-models, team aliases)
the proxy enforces on `/chat/completions` apply here too. Any
unauthorized model raises a 403 before the file is forwarded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(batches): count pre-tokenized prompt/input shapes, classify 403 logs

Two follow-ups from the Greptile review on the batch validation PR:

1. P1 TPM bypass via integer token arrays. The OpenAI batch schema
   accepts ``prompt`` and ``input`` as ``list[int]`` (a single
   pre-tokenized prompt) or ``list[list[int]]`` (multiple) in addition
   to the string and ``list[str]`` shapes. Pre-fix only the string
   shapes were counted, so a caller could submit a batch with hundreds
   of millions of pre-tokenized tokens and the rate limiter would
   record zero. Extract the per-field logic into
   ``_count_prompt_or_input_tokens`` and count each int as one token.

2. P2 access-denial logs were indistinguishable from I/O failures.
   ``count_input_file_usage`` caught every exception under a generic
   "Error counting input file usage" message, so an intentional 403
   from ``_enforce_batch_file_model_access`` looked the same in the
   logs as a missing file or a Prisma timeout. Catch ``HTTPException``
   separately and log 403s at WARNING level with a security-relevant
   message before re-raising.

Tests cover the new shapes: single ``list[int]``, ``list[list[int]]``
(the worst-case bypass vector), and embeddings ``input`` with
pre-tokenized arrays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): re-validate user_id after /user/info re-parses query (#27009)

* fix(proxy): re-validate user_id ownership after /user/info re-parses query

The route-level access check in `RouteChecks.non_proxy_admin_allowed_routes_check`
reads `request.query_params.get("user_id")`, which decodes literal `+` to
spaces. The endpoint then re-parses the raw query string with `urllib.unquote`
in `get_user_id_from_request` to preserve `+` characters (so plus-addressed
emails work as user_ids). Those two paths produce different ids: a caller
who registered a user_id containing a literal space could pass the route
check and then read another user's row by sending the encoded `+` form.

Add `_enforce_user_info_access` and call it after `_normalize_user_info_user_id`
returns the final id. Proxy admin / view-only admin still bypass; everyone
else must match the resolved user_id (or have no user_id, which falls back
to the caller's own id later in the handler).

Tests cover the admin bypass, owner-match path, and the cross-user lookup
that this change blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): apply user_info ownership check to PROXY_ADMIN_VIEW_ONLY

`_enforce_user_info_access` was bypassing both PROXY_ADMIN and
PROXY_ADMIN_VIEW_ONLY, but the upstream route check in
`RouteChecks.non_proxy_admin_allowed_routes_check` only treats
PROXY_ADMIN as a true admin for the `/user/info` route — view-only
admins go through the `user_id == valid_token.user_id` enforcement
along with regular users. Mirroring that asymmetry left the same
encoded-`+` bypass open for view-only admins whose user_id contains a
literal space.

Drop the PROXY_ADMIN_VIEW_ONLY exemption so the post-decode re-check
matches the upstream rule. Update tests: a view-only admin must now
be blocked from cross-user lookups but still allowed to read their
own row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(spend-logs): opt-in suppression of stack traces in spend-tracking error logs

Adds LITELLM_SUPPRESS_SPEND_LOG_TRACEBACKS env var. When set to true and the
proxy log level is INFO or above, spend-tracking error paths emit a single
ERROR line without the full traceback. Stack traces are preserved at DEBUG
and the Sentry / proxy_logging_obj.failure_handler path is unchanged.

The new spend_log_error helper is wired through the spend write hot path:
  - DBSpendUpdateWriter (update_database, _update_*_db, batch upsert,
    redis-commit fallbacks)
  - _ProxyDBLogger._PROXY_track_cost_callback
  - get_logging_payload exception path
  - update_spend / update_daily_tag_spend / spend logs queue monitor

Resolves LIT-2704.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(spend-logs): preserve no-traceback behavior for update_daily_tag_spend

This call site previously logged a single-line error via verbose_proxy_logger.error()
with no traceback. Switching it to spend_log_error(..., exc=e) caused a full stack
trace to render by default (when LITELLM_SUPPRESS_SPEND_LOG_TRACEBACKS is unset),
which contradicts the PR goal of leaving default behavior unchanged. Revert this
specific site to the original error log call.

* fix(spend-logs): preserve no-traceback behavior for update_daily_tag_spend

Bugbot caught a regression: the previous error log here was a single-line
verbose_proxy_logger.error(...) with no traceback. spend_log_error attaches
the active exception's traceback by default (when the suppression env var
is unset), so swapping it in changed default behavior. Revert this one site
to its original .error() call to keep the PR strictly opt-in.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* feat(spend-logs): suppress traceback in SpendLogs error_information row

Extend LITELLM_SUPPRESS_SPEND_LOG_TRACEBACKS to the failure callback so the
per-row Metadata pane in the UI no longer shows the stack trace when the
opt-in env var is set, matching the existing console-side suppression.

https://claude.ai/code/session_014dztoRbRnRvq54HL9EyHx6

* [Fix] Proxy: Repair Merge Fallout In Router-Override Fallback Auth

Conflict resolution for #26968 dropped the `Iterator` typing import
(NameError at module load), left a dead `fallback_models = cast(...)`
block, and the new tests called `_enforce_key_and_fallback_model_access`
without the now-required `request` kwarg.

* isolate dual OTEL handlers

* harden cloud file compatibility path

* harden cloud file compatibility path

* [Fix] Proxy/Key Management: Align Key-Org Membership Checks On Generate And Regenerate

Mirrors the membership rule on /key/update so that /key/generate and
/key/{key}/regenerate apply the same `_validate_caller_can_assign_key_org`
gate when the caller specifies an `organization_id`. Proxy admins bypass.
The check no-ops when `organization_id` is not being set.

* thread trusted params through vertex file content

* trust only server legacy file flag

* chore(proxy): keep public AI hub unauthenticated

* fix(proxy): preserve low-detail readiness status

* [Test] Anthropic: Replace Legacy Claude-4-Sonnet Alias With Haiku 4.5

Three live-API tests pinned to claude-4-sonnet-20250514, which is a
non-canonical alias of claude-sonnet-4-20250514. Anthropic's main API
no longer resolves the legacy form under freshly issued keys, so the
tests fail with not_found_error. The token counter test pinned to
claude-sonnet-4-20250514 itself (deprecation_date 2026-05-14, two weeks
out) was on borrowed time too.

Bump all four to claude-haiku-4-5-20251001 — capability superset for what
these tests exercise (streaming, parallel tool calling, extended thinking,
token counting), no upcoming deprecation, cheaper per-token.

* chore(proxy): move URL-valued model/file_id guard from SDK to proxy

The previous per-provider guards in HuggingFace, Oobabooga, and Gemini
files lived in the SDK layer, breaking SDK callers who legitimately pass
URL-valued model identifiers. Move the check to the proxy boundary in
add_litellm_data_to_request so SDK users keep working while proxy users
default-deny URL-valued model and file_id, with admin opt-in via
litellm.provider_url_destination_allowed_hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [Chore] Proxy/UI: Drop stray _experimental/out/chat/index.html

This file is a regenerable UI build artifact that should not be tracked
in source. Removing so the merge into litellm_internal_staging stays clean.

* [Test] Anthropic Passthrough: Bump Streaming Cost-Injection Test To Haiku 4.5

test_anthropic_messages_streaming_cost_injection hits the proxy's
/v1/messages route, which routes via the anthropic/* wildcard to
api.anthropic.com. The 404 surfaced in the test was Anthropic's own
not_found_error propagated back through the proxy (visible from the
x-litellm-model-id hash on the response — the proxy did route).

Same root cause as the prior commit: the legacy claude-4-sonnet-20250514
alias is no longer recognized by Anthropic's main API under the new key.
Swap to claude-haiku-4-5-20251001 — same routing path, canonical model.

* fix(proxy): handle ownership-recording failures after upstream create

If record_container_owner raises after the upstream container is created,
the user previously got a 500 with no usable container — they were billed
for an unreachable resource. Move ownership recording into the create
path's exception handling and split the two failure modes:

- HTTPException from the recorder (auth conflicts) propagates verbatim
  so the client sees the real status code, not a generic LLM error.
- Unexpected exceptions are logged and swallowed; the response is
  returned to the caller so they aren't billed for a container they
  can't address. The DB row stays untracked until an operator reconciles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(guardrails): close post-call coverage gaps

* fix(types): add /team/permissions_bulk_update to management_routes

The blocklist check in _check_proxy_admin_viewer_access only fires for
routes that match LiteLLMRoutes.management_routes — the bulk-update
endpoint was missing from that list, so the test for view-only admins
on /team/permissions_bulk_update fell through to "allow."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [Test] Anthropic Passthrough: Bump Thinking Tests Off Legacy Sonnet 4 Alias

base_anthropic_messages_test.test_anthropic_messages_with_thinking and
test_anthropic_streaming_with_thinking still pinned to
claude-4-sonnet-20250514 — the same legacy alias Anthropic no longer
recognizes under freshly issued keys. The other four tests in this base
class already use claude-sonnet-4-5-20250929; these two were missed.

Bump to claude-haiku-4-5-20251001 (supports_reasoning=true, no upcoming
deprecation). Subclasses including TestAnthropicPassthroughBasic
inherit these methods.

* fix(guardrails): cover multi-choice output variants

* fix(proxy): preserve public ai hub ui setting

* fix(scim): cascade FK cleanup on user delete and surface block status in UI

SCIM DELETE /Users/{id} previously called litellm_usertable.delete without
clearing rows that FK back to the user, so Postgres rejected the delete with
LiteLLM_InvitationLink_user_id_fkey and the SCIM caller saw a 500. Add a
helper to drop invitation_link, organization_membership, and team_membership
rows before the user delete (mirrors /user/delete in internal_user_endpoints).

Also add a Status column to the Virtual Keys and Internal Users tables so
admins can see at a glance which keys are blocked and which users SCIM has
deactivated. SCIM-blocked keys carry a tooltip explaining the origin.

Pin the dashboard's Node version to 20 via .nvmrc to match CI.

* chore: update Next.js build artifacts (2026-05-02 03:21 UTC, node v20.20.2)

* perf(proxy): cache container/skill ownership reads on the hot path

Container ownership and skill rows are looked up on every retrieve /
delete / list / file-content / chat-completion-with-skill call. The new
stores wrapped raw Prisma queries with no cache, putting one DB
round-trip on each request. Add an in-process TTL'd cache mirroring the
_byok_cred_cache pattern in mcp_server/server.py: per-key (value,
monotonic_timestamp), 60s TTL, 10000-entry cap with full-clear on
overflow, invalidated by every write. Negative results (`None`) are
cached too so untracked-resource checks also skip the DB.

Tests cover: cache-after-first-hit, negative caching, write
invalidation, no-caching-on-DB-error, TTL expiry, capacity eviction.
56 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: update Next.js build artifacts (2026-05-02 03:39 UTC, node v20.20.2)

* fix: remove traceback key instead of it being ""

* fix: linting error

* fix(scim): preserve scim_active on PUT when client omits the field

A SCIM PUT may legally omit `active` (full-replace with the field
absent). Pydantic fills the SCIMUser.active default of True, so the PUT
handler was overwriting metadata.scim_active with True even when the
client never sent it — silently reactivating a previously SCIM-blocked
user and unblocking their keys.

Use model_fields_set to detect whether the client actually sent
`active`. If omitted, preserve the prior scim_active value and skip
the cascade to virtual keys.

Also drop comments added in this PR that just narrate what the code
does; keep only the docstrings and the SQL-NULL pitfall note that
explain non-obvious behaviour.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(proxy): use set lookup for permitted agent filters

* fix(mcp): redact command fields for non-admin server views

* fix(proxy): forward decoded container ids after ownership checks

* fix(caching): handle stale isolated Redis semantic index

* fix(cloudflare): support response_text in streaming chunk parser

Newer Cloudflare Workers AI models (e.g. Nemotron) emit 'response_text'
instead of 'response' on streamed chunks. The non-streaming path was
already updated to fall back to 'response_text' (#26385), but the
streaming chunk parser still only read 'response', which caused
streaming requests against those models to silently produce empty
content.

Mirror the non-streaming fallback in CloudflareChatResponseIterator.chunk_parser
and add a streaming test for the response_text shape.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* Fix code qa

* Address bugbot: drop dead encode/decode helpers; preserve empty custom_id

- Remove unused _encode_gcp_label_value / _decode_gcp_label_value singular
  helpers; only the _chunks variants are actually called.
- Use 'is not None' check for custom_id so empty-string custom_ids are
  still labeled and round-trip through batch outputs.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* Forward Vertex file content logging context

* test vertex file content logging forwarding

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* Fix Vertex batch output logging mutation

* fix: don't mutate caller's logging_obj in _try_transform_vertex_batch_output_to_openai

The method was overwriting logging_obj.optional_params, logging_obj.model,
and logging_obj.start_time on the caller's Logging instance. When invoked
from llm_http_handler.py's generic framework path, the framework's own
logging_obj (which already went through pre_call) had its properties
clobbered, causing model and start_time to reflect the last batch line's
values rather than the original call context.

Fix: create a fresh local Logging instance for the per-line transformation
instead of mutating the incoming logging_obj. The caller's object is now
left entirely untouched regardless of whether a logging_obj was passed in
or not.

Regression tests added to verify model, start_time, and optional_params
are not mutated on the caller's logging_obj.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* feat: add opt-out flag for Vertex batch output transformation

Adds litellm.disable_vertex_batch_output_transformation (default False).
When True, afile_content returns raw Vertex predictions.jsonl untouched
so users that parse candidates/modelVersion directly are not broken.

* fix(anthropic,bedrock): omit thinking/output_config when reasoning_effort="none"

Setting reasoning_effort="none" on Anthropic chat models (direct, Bedrock
Invoke, Bedrock Converse, Vertex AI Anthropic, Azure AI Anthropic) crashed
LiteLLM with:

  litellm.APIConnectionError: 'NoneType' object has no attribute 'get'

Both the Anthropic chat transformation and Bedrock Converse called
``AnthropicConfig._map_reasoning_effort`` and assigned the ``None`` it returns
for ``"none"`` directly to ``optional_params["thinking"]``. Downstream
``is_thinking_enabled`` then did ``optional_params["thinking"].get("type")``
and crashed.

Pop ``thinking`` (and on Claude 4.6/4.7, ``output_config``) instead of
assigning ``None``, restoring the documented contract that
``reasoning_effort="none"`` means "do not enable thinking". This also
prevents downstream Anthropic 400s ("thinking: Input should be an object",
"output_config.effort: Input should be ...") if the bug were ever masked.

Verified end-to-end against the live Anthropic API and Bedrock Converse
on claude-opus-4-{5,6,7} and claude-sonnet-4-6, plus Bedrock Invoke for
Claude 4.5/4.6. Vertex AI Anthropic and Azure AI Anthropic inherit the
fixed ``map_openai_params`` from ``AnthropicConfig`` and need no further
changes.

* fix(vertex-ai): set response=null on batch error entries per OpenAI spec

The Vertex batch output transformer was emitting both a populated 'response' and 'error' for failed batch entries. The OpenAI Batch output spec defines them as mutually exclusive: on error 'response' MUST be null. This broke any consumer using 'result["response"] is None' to detect failures.

* test(vertex-ai): cover transformation_error path emits response=null

* fix(security): sandbox jinja2 in gitlab/arize/bitbucket prompt managers

DotpromptManager was hardened to render through
ImmutableSandboxedEnvironment. The three sibling managers (gitlab,
arize, bitbucket) were missed and still instantiate plain
jinja2.Environment(), leaving the same attribute-traversal SSTI
primitive open: a template fetched from a GitLab/BitBucket repo or
Arize Phoenix workspace can reach __class__.__init__.__globals__ and
execute arbitrary Python on the proxy host.

Match the dotprompt pattern by switching all three to
ImmutableSandboxedEnvironment. The sandbox blocks the dunder-traversal
chain while leaving normal {{ var }} substitution intact, so the
template surface is unchanged for legitimate use.

Adds tests/test_litellm/integrations/test_prompt_manager_ssti.py
(18 cases) verifying each manager's jinja_env is a sandbox, that
classic SSTI payloads raise SecurityError, and that ordinary variable
rendering still works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(proxy): drop client-supplied pricing fields from request bodies

The proxy currently forwards request-body pricing parameters (the fields
on `CustomPricingLiteLLMParams`, plus `metadata.model_info`) into the
core call path. Those fields belong to deployment configuration, not to
per-request input — sending them from a client mutates the request's
recorded cost and, via `litellm.completion` → `register_model`, the
process-wide `litellm.model_cost` map for every later caller in the
worker. Strip them at the boundary.

The strip set is built from `CustomPricingLiteLLMParams.model_fields` so
pricing fields added later are covered automatically. Operators who do
want clients to supply per-request pricing can opt back in per key or
team via `metadata.allow_client_pricing_override = true`, mirroring the
existing `allow_client_mock_response` and
`allow_client_message_redaction_opt_out` flags.

Tests cover the strip set's coverage, root and metadata strips, the
opt-in skip on both key and team metadata, and a regression check that
the global `litellm.model_cost` map is unmutated after a stripped
request.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(proxy): log stripped pricing fields at debug for operator visibility

Operators upgrading would otherwise see client-supplied pricing overrides
silently stop applying with no diagnostic. Emit a debug-level line listing
the dropped fields and pointing at the opt-in flag when any are stripped;
stay silent on the no-op path so the log isn't filled with noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): move pricing strip below the litellm_metadata JSON-string parse

The strip ran before the proxy parses ``litellm_metadata`` from a JSON
string into a dict (a path used by multipart/form-data and ``extra_body``
callers), so ``isinstance(metadata, dict)`` was False and ``model_info``
survived the strip. Move the call to the same post-parse position the
``user_api_key_*`` strip already uses for the same reason. Adds a
regression test exercising the JSON-string ``litellm_metadata`` path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(responses): replace legacy claude-4-sonnet alias in multiturn tool-call test

Anthropic's main API no longer resolves the non-canonical 'claude-4-sonnet-20250514'
alias for freshly issued keys, returning 404 not_found_error. PR #27031 already
swept three other live tests pinned to this alias to claude-haiku-4-5-20251001
but missed test_multiturn_tool_calls in the responses API suite, which is now
failing reliably on PR CI runs (e.g. PR #27074, job 1603363).

Bump the two model references in test_multiturn_tool_calls to the same
claude-haiku-4-5-20251001 snapshot used by PR #27031 -- it covers everything
this test exercises (tool calling, multi-turn) and isn't on a deprecation
schedule.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(proxy): close callback-config and observability-credential side channels

Two related gaps in the proxy's request bouncer:

1. ``is_request_body_safe`` (auth_utils.py) walked the request-body root
   and the ``litellm_embedding_config`` nested dict, but not ``metadata``
   or ``litellm_metadata``. The same fields it bans at root — Langfuse /
   Langsmith / Arize / PostHog / Braintrust / Phoenix / W&B Weave / GCS /
   Humanloop / Lunary credentials and routing — were silently accepted
   when the caller put them inside metadata, retargeting observability
   callbacks to a caller-controlled host with caller-supplied creds.
   Walk both metadata containers (and parse the JSON-string form sent via
   multipart / ``extra_body``) through the same banned-params helper, so
   the existing ``allow_client_side_credentials`` opt-in covers both
   paths consistently.

2. The banned-params list was hand-maintained and lagged the canonical
   ``_supported_callback_params`` allow-list in
   ``initialize_dynamic_callback_params``. Derive the observability bans
   from that allow-list (minus a small ``_SAFE_CLIENT_CALLBACK_PARAMS``
   set for informational fields like ``langfuse_prompt_version`` and
   ``langsmith_sampling_rate``) so future integrations are covered
   automatically; ``_EXTRA_BANNED_OBSERVABILITY_PARAMS`` carries the
   handful of fields integrations read but the allow-list hasn't caught
   up to. A guard test fails CI if a new entry is added to
   ``_supported_callback_params`` without an explicit safe-list decision.

Separately in ``litellm_pre_call_utils.py``: add ``callbacks``,
``service_callback``, ``logger_fn``, and ``litellm_disabled_callbacks``
to ``_UNTRUSTED_ROOT_CONTROL_FIELDS``. The first three are appended to
worker-wide ``litellm.{input,success,failure,_async_*,service}_callback``
lists / ``litellm.user_logger_fn`` from inside ``function_setup`` — one
request poisons every subsequent caller in that worker. The last is the
inverse primitive: the legitimate path reads it from key/team metadata,
the request-body version silently disables admin-configured audit /
observability for the call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): per-param allow must continue, not return early

A pre-existing logic bug in ``_check_banned_params``: when the
deployment-level ``configurable_clientside_auth_params`` permitted one
banned field, the loop ``return``-ed on the first match instead of
``continue``-ing, so any other banned param later in the same body or
metadata dict was never checked. This PR's metadata walk multiplies the
surface where that bypass matters — a body pairing an allowed
``api_base`` with an observability credential like ``langfuse_host``
would silently pass.

Proxy-wide ``allow_client_side_credentials`` keeps ``return`` (it's a
global opt-in for every banned param). The per-param branch becomes
``continue`` so only the one explicitly-permitted field is skipped.

Adds a regression test that exercises the api_base + langfuse_host pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(vector_store): resolve embedding config at request time, never persist creds

The vector store create/update path previously called
``_resolve_embedding_config`` against the admin-configured router/DB
model and persisted the resolved ``litellm_embedding_config`` dict
(``api_key`` / ``api_base`` / ``api_version``) into the
``litellm_managedvectorstorestable.litellm_params`` column. Because the
resolver expanded ``os.environ/...`` references via ``get_secret``, the
DB row carried cleartext provider credentials, and the
``/vector_store/{new,info,update,list}`` responses returned them to any
authenticated caller who could supply a known admin model name.

Move the auto-resolve out of ``create_vector_store_in_db`` and out of
the update path. Persist only the user-supplied ``litellm_embedding_model``
reference. Resolve at request-handling time inside
``_update_request_data_with_litellm_managed_vector_store_registry`` so
the resolved config lives in the per-request ``data`` dict and is
garbage-collected after the response. Legacy rows that were created by
an earlier proxy version and already carry a resolved
``litellm_embedding_config`` skip the re-resolution and pass through
unchanged so embedding calls keep working.

The ``new_vector_store`` response now also runs the existing
``_redact_sensitive_litellm_params`` masker (already used by ``info``,
``update``, and ``list``), defending against caller-supplied cleartext
on the create path and against legacy rows whose persisted credentials
are still in the database.

Existing tests that asserted the old write-time-resolve behaviour are
updated to assert the new persistence shape (no embedding config
stored, just the model reference). Two new tests cover the use-time
path: one asserting fresh resolution happens when a row carries only
the model reference, the other asserting legacy rows with persisted
config skip re-resolution and continue to work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(vector_store): tighten registry-mutation comment and dedupe test helpers

* fix(vector_store): cache use-time embedding-config resolution

Hold the resolved config in a process-memory TTL cache so the
request-handling path doesn't run litellm_proxymodeltable.find_first
on every vector-store call.

* fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort

Follow-up bugs surfaced by the QA sweep on PR #27039
(https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610).

1. Stop stripping output_config.effort on Bedrock + Vertex adaptive routes.
   - Vertex AI Claude 4.6/4.7 accepts output_config.effort on rawPredict
     (verified end-to-end against us-east5 / global). The strip helper now
     no-ops for effort.
   - Bedrock Converse routes output_config into additionalModelRequestFields
     for anthropic base models so the requested adaptive tier (low/medium/
     high/xhigh/max) actually reaches the wire instead of all collapsing to
     identical thinking.
   - Bedrock Invoke chat transformation (AmazonAnthropicClaudeConfig) stops
     popping output_config from the post-AnthropicConfig request body.
   - Bedrock Invoke /v1/messages allowlist (BedrockInvokeAnthropicMessagesRequest)
     now lists output_config so the runtime allowlist filter forwards it.

2. Validate effort across Bedrock Converse so 'disabled' / 'invalid' / '' /
   unsupported tiers (xhigh/max on Sonnet 4.6 or budget-mode 4.5 models)
   surface as a clean 400 BadRequestError instead of 500.

3. ValueError -> BadRequestError throughout (AnthropicConfig.map_openai_params,
   _apply_output_config, AmazonConverseConfig._handle_reasoning_effort_parameter).
   Empty-string effort is now rejected (was silently passing the
   'if effort and ...' short-circuit).

4. Floor reasoning_effort='minimal' at the Anthropic provider minimum
   (1024 budget_tokens) via new ANTHROPIC_MIN_THINKING_BUDGET_TOKENS so it's
   a usable tier on direct Anthropic / Azure AI Anthropic / Vertex AI Anthropic /
   Bedrock Invoke (all of which 400 below 1024).

5. model_prices: dedupe duplicate supports_max_reasoning_effort key on
   claude-opus-4-7 / claude-opus-4-7-20260416.

Adds regression tests across all five affected paths; existing tests asserting
the silent-strip behavior were updated to reflect the new pass-through and
clean 400 surfaces.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(constants): make ANTHROPIC_MIN_THINKING_BUDGET_TOKENS a plain constant

The documentation CI test (tests/documentation_tests/test_env_keys.py)
asserts every os.getenv() key in the source has a matching entry in the
litellm-docs config_settings.md table. ANTHROPIC_MIN_THINKING_BUDGET_TOKENS
tracks Anthropic's published wire-protocol minimum (1024) — it's not a
user-tunable, so making it env-overridable was wrong anyway. Drop the
os.getenv() wrapper; the value is now a plain literal.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(anthropic,bedrock): correct effort error message and dedupe effort_map

- Remove 'none' from the Bedrock _validate_anthropic_adaptive_effort error
  message; it was listed as a valid value but rejected by the membership
  check, leaving users in a feedback loop if they tried 'none'.
- Hoist the duplicated reasoning_effort -> output_config.effort mapping
  out of AnthropicConfig.map_openai_params and
  AmazonConverseConfig._handle_reasoning_effort_parameter into a single
  AnthropicConfig.REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT class constant
  so the two routes cannot drift.

* fix(anthropic): translate reasoning_effort on /v1/messages route

Closes the remaining QA-sweep gap on PR #27074: Bedrock Invoke
/v1/messages was silently ignoring ``reasoning_effort`` because the
shared param filter only kept native Anthropic keys, so every effort
tier collapsed to the same behavior on the wire (27/231 cells failing
across opus-4-5 / opus-4-6 / sonnet-4-6).

Map ``reasoning_effort`` to native Anthropic ``thinking`` /
``output_config.effort`` at the ``AnthropicMessagesConfig`` layer so
all four /v1/messages routes (direct Anthropic, Azure AI, Vertex AI,
Bedrock Invoke) inherit the same translation:

- Add ``reasoning_effort`` to ``AnthropicMessagesRequestOptionalParams``
  so the param filter in
  ``AnthropicMessagesRequestUtils.get_requested_anthropic_messages_optional_param``
  no longer drops it before the transformation runs.

- Add ``_translate_reasoning_effort_to_anthropic`` and call it from
  ``transform_anthropic_messages_request``. Mirrors
  ``AnthropicConfig.map_openai_params`` on the chat completion path
  (re-uses ``_map_reasoning_effort`` and
  ``REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT``) so the two routes
  cannot drift. Pops ``reasoning_effort`` so it never reaches the wire.

- Caller-supplied native ``thinking`` / ``output_config.effort`` always
  win — same precedence as
  ``_translate_legacy_thinking_for_adaptive_model``.

- Garbage values (``""``, ``"disabled"``, ``"invalid"``) raise
  ``AnthropicError(status_code=400)`` instead of falling through and
  surfacing as 500s from the provider.

- ``"none"`` clears thinking + output_config so callers can opt out
  per request.

Also restores the non-adaptive-model test coverage on Bedrock Invoke
/v1/messages that the previous commit lost when
``test_bedrock_messages_strips_output_config`` was renamed to the
``forwards`` variant on Opus 4.7.

Adds a new test file
``test_reasoning_effort_translation.py`` covering the translation at
the shared config level (adaptive + non-adaptive models, none, garbage,
caller precedence) so all four /v1/messages routes are exercised by a
single suite.

Adds parametrized + behavioral tests on the Bedrock Invoke /v1/messages
suite covering: minimal/low/medium/high/xhigh/max mapping for adaptive
models, thinking-budget mapping for non-adaptive Opus 4.5, ``none``
clears both, garbage raises 400, explicit ``output_config`` wins.

Refs: https://github.com/BerriAI/litellm/pull/27074

* fix(anthropic,bedrock): reject unmapped reasoning_effort at mapping site

Both the chat completion path (AnthropicConfig.map_openai_params) and the
Bedrock Converse path (_handle_reasoning_effort_parameter) used
REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT.get(value, value) which falls
back to the raw input on unmapped keys. Combined with _map_reasoning_effort
returning type='adaptive' for any string on Claude 4.6/4.7, garbage values
(e.g. 'disabled') could leak into optional_params['output_config']['effort']
unvalidated if map_openai_params ran without the downstream transform_request
or _validate_anthropic_adaptive_effort check.

Mirror the /v1/messages pattern: use .get(value) (no fallback) and raise
BadRequestError immediately when the value is unmapped, co-locating
validation with the mapping for defense in depth.

* style: black formatting

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(anthropic): stop class-attr leak; gate xhigh/max on every route

The reasoning-effort mapping dict was a public class attribute on
AnthropicConfig, so BaseConfig.get_config returned it as a request
parameter and every Anthropic-backed call (Anthropic / Azure / Vertex /
Bedrock Invoke) hit a 400 'REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT:
Extra inputs are not permitted' from the provider. Move the mapping
to a module-level constant.

_supports_effort_level only looked the model up under
custom_llm_provider='anthropic', so bedrock-prefixed model ids
(e.g. bedrock/invoke/us.anthropic.claude-opus-4-7) returned False
for both 'max' and 'xhigh' even when the underlying model entry has
the flag set. Strip known provider prefixes and retry the lookup
against litellm.model_cost directly so per-model gating works on
every route.

Mirror the per-model xhigh/max gate from
AnthropicConfig._apply_output_config in
AnthropicMessagesConfig._translate_reasoning_effort_to_anthropic so
the /v1/messages route also raises a clean 400 instead of forwarding
the unsupported tier.

* feat(anthropic,bedrock): strip output_config under drop_params for non-effort models

When a proxy fronts Claude Code (which always sends `output_config.effort`)
at a pre-4.5 Anthropic model — haiku-3, sonnet-3.5, opus-3, sonnet-4 — the
forwarded knob causes a forced 400 the client can't fix. Gating a strip
behind the existing `drop_params` flag lets operators opt into silent
fixup once and stop worrying about per-model param hygiene.

Default (`drop_params=False`) still forwards and surfaces the provider's
error, preserving the strict, debuggable contract from #27074.

Per https://platform.claude.com/docs/en/build-with-claude/effort the
supporting set is Opus 4.5+, Sonnet 4.6+, and Mythos Preview; everything
else is dropped (with a verbose_logger warning so the strip is visible).
Recognition uses model-name patterns plus a fallback to any
`supports_*_reasoning_effort` flag in the model map for forward
compatibility with new entries.

https://claude.ai/code/session_01WjHq31rvXT6xYNdVmSJvRp

(cherry picked from commit 1233943e7861ba8a9062f792310ebd401cb03db8)

* fix(base_llm): filter all _-prefixed class attrs from get_config

The drop_params strip work added `AnthropicConfig._EFFORT_SUPPORTING_MODEL_PATTERNS`
as a private class-level lookup tuple. `BaseConfig.get_config()` only
filtered the `__`-prefixed names plus `_abc` / `_is_base_class`, so
`_EFFORT_SUPPORTING_MODEL_PATTERNS` would have leaked into the request
body the same way `REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT` did before
the previous commit.

Generalize the existing `_abc` / `_is_base_class` carve-outs to skip
every `_`-prefixed name. `AmazonConverseConfig.get_config()` overrides
the base method, so apply the same change there.

Also unblocks future internal helpers from accidentally serialising into
the wire body.

* fix(anthropic): drive output_config.effort support from model map flags

Replace hardcoded _EFFORT_SUPPORTING_MODEL_PATTERNS with a JSON-backed
check that uses supports_*_reasoning_effort flags from the model map.
Add supports_minimal_reasoning_effort: true to opus-4-5 and mythos-preview
entries (which previously only carried supports_reasoning) so the JSON
remains the single source of truth for effort capability.

* fix(anthropic,bedrock,databricks): four reasoning_effort follow-ups

- claude-sonnet-4-6 + reasoning_effort=max no longer 400s. Renamed
  _is_opus_4_6_model to _is_claude_4_6_model at three sites and added
  supports_max_reasoning_effort: true to 12 model entries in the JSON
  cost map (10 sonnet 4.6 ids + OpenRouter opus 4.6/4.7).
- _map_reasoning_effort now raises BadRequestError(400) directly with
  llm_provider, instead of letting Databricks (and similar callers)
  surface its raw ValueError as a 500.
- output_config.effort on Opus 4.5 over Bedrock no longer 400s for
  missing effort-2025-11-24 beta. Flipped JSON to "effort-2025-11-24"
  for bedrock + bedrock_converse and added an auto-attach branch in
  _process_tools_and_beta for non-adaptive Anthropic + output_config
  on Converse.
- reasoning_effort=xhigh / =max on legacy budget-mode models
  (Haiku 4.5, Sonnet 4.5, Opus 4.5) now map to thinking.budget_tokens
  8192 / 16384 instead of returning 400. Added two constants in
  litellm/constants.py.

Tests updated for all four flips. Validated end-to-end via 306-cell
live proxy matrix (6 model families x 3 routes x 17 effort cases),
all pass.

* fix(databricks): validate reasoning_effort and set output_config on adaptive Claude

The Databricks path called `AnthropicConfig._map_reasoning_effort` for
Claude models but never validated the effort string nor set
`output_config.effort` for adaptive models (Claude 4.6/4.7). Since
`_map_reasoning_effort` returns `type=adaptive` for ANY non-None /
non-"none" string on adaptive models (including "disabled",
"invalid", ""), Databricks silently accepted garbage and emitted a
request without an `output_config.effort`, collapsing every adaptive
tier to identical behavior.

Match the Anthropic native, Bedrock Converse, Bedrock Invoke, and
/v1/messages paths: when the resolved `thinking` is non-None on a
4.6/4.7 model, look up the value in
`REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT` and either raise a clean
`BadRequestError` or set `optional_params["output_config"]`.

* fix(azure): omit model from image generation and image edit deployment requests

Azure OpenAI routes image gen/edit by deployment in the URL; sending the
deployment id in model breaks gpt-image-2 (invalid_value). Strip model from
JSON for deployments/.../images/generations and from multipart data for
.../images/edits. Non-deployment URLs (e.g. Azure AI FLUX) unchanged.

Fixes #26316.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(azure): exercise image gen JSON filter via HTTP client; dedupe image edit URL

- Image generation tests patch HTTPHandler.post / get_async_httpx_client so
  make_*_azure_httpx_request runs and wire json is asserted on call kwargs.
- Azure image edit: strip model in finalize_image_edit_multipart_data using the
  same URL string the handler passes to POST (no second get_complete_url in
  transform). BaseImageEditConfig default finalize is a no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(azure_ai/anthropic): promote output_config out of extra_body so validation runs

`azure_ai` is registered in `litellm.openai_compatible_providers`, so
`add_provider_specific_params_to_optional_params` (litellm/utils.py)
auto-stuffs any non-OpenAI kwarg (e.g. `output_config={"effort": "..."}`)
into `optional_params["extra_body"]`. `AzureAnthropicConfig.transform_request`
then strips `extra_body` entirely on the way out, silently dropping the
param — and `AnthropicConfig._apply_output_config` never sees it, so
`effort="invalid"` / `effort="xhigh"` on a non-supporting model
quietly reaches the model with default behavior instead of returning a
clean 400 (as the native `anthropic` provider does).

Promote the keys back to top-level `optional_params` (using `setdefault`
so explicit top-level values win) before delegating to the parent
`AnthropicConfig`. Apply in both `validate_environment` and
`transform_request` so flag detection (`is_mcp_server_used`, etc.) and
output-config validation both run.

Surfaced by the QA matrix expansion on PR #27074: 20 cells where Azure
returned 200 while `anthropic` returned 400 — all `output_config` mode
across haiku_4_5, sonnet_4_5, opus_4_5, sonnet_4_6, opus_4_6, opus_4_7
families with `effort` in {invalid, xhigh, max, low, medium, high}.

Tests:
* `test_output_config_promoted_from_extra_body`: valid effort reaches data
* `test_invalid_output_config_effort_raises_via_extra_body`: 400 on bad effort
* `test_unsupported_effort_xhigh_raises_via_extra_body`: 400 on xhigh-on-Sonnet-4.6
* `test_extra_body_promotion_does_not_clobber_top_level`: setdefault semantics

* test(image_gen): expect no model in Azure image edit multipart (#26316)

Align test_azure_image_edit_litellm_sdk with deployment-scoped Azure edits.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(anthropic): extract _validate_effort_for_model to prevent drift

The chat completion path (`_apply_output_config`) and the /v1/messages
pass-through (`AnthropicMessagesConfig._translate_reasoning_effort_to_anthropic`)
both gate `max` / `xhigh` per model. The two sites had diverged from
near-identical copies into separately maintained blocks, creating a real
drift risk when a new model tier (e.g. Claude 4.8) lands -- a contributor
could update one site and miss the other.

Centralise the gating in `AnthropicConfig._validate_effort_for_model`,
which returns an error message string or `None`. Each call site keeps
its own provider-appropriate exception type (`BadRequestError` for the
chat path, `AnthropicError` for the /v1/messages pass-through) but the
gating decision now comes from one place. Net -11 LOC.

Adds a parametrised unit test exercising the helper directly across
4.5 / 4.6 / 4.7 model families and `max` / `xhigh` / lower-effort
inputs. Existing tests at both call sites continue to pass unchanged.

Addresses Greptile finding on PR #27074.

* fix(databricks): narrow reasoning_effort_value to str for mypy

`non_default_params.get("reasoning_effort")` returns `Any | None`,
but `REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT.get()` expects `str`.
Mypy flagged this on the strict pass. Narrow with `isinstance` before
the lookup; non-strings fall through to the existing `BadRequestError`
below with a clean validation message, so behavior is unchanged.

Fixes a regression introduced by 1a10746e95 in this PR.

* feat(proxy): add health_check_reasoning_effort for model health checks

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(image_gen): align Azure image gen fixture with body omitting model

Expected JSON matches deployment-scoped Azure POST (#26316).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(anthropic/chat): force PR-local model_cost map via autouse fixture

CI runs without LITELLM_LOCAL_MODEL_COST_MAP=True, so litellm.model_cost
is loaded from main-branch JSON (default model_cost_map_url) instead of
the PR's checked-out model_prices_and_context_window.json. Tests that
assert per-model flags added in this PR (supports_max_reasoning_effort,
supports_xhigh_reasoning_effort) therefore pass locally but fail in CI
with 'AssertionError: assert False is True' on 5 cases:

  - test_anthropic_model_supports_effort_param_recognizes_supporting_models
    [anthropic.claude-mythos-preview, bedrock/.../mythos-preview,
     claude-opus-4-5-20251101]
  - test_supports_effort_level_handles_provider_prefixes
    [bedrock/invoke/us.anthropic.claude-sonnet-4-6-max-True,
     claude-sonnet-4-6-max-True]

Add an autouse fixture at tests/test_litellm/llms/anthropic/chat/conftest.py
that monkey-patches litellm.model_cost to the PR-local JSON for every test
in this directory. The parent conftest already snapshots+restores
litellm.model_cost per-function, so the mutation is contained.

This is a scoped workaround. The proper fix is to set the env var
globally in the test workflow once the ~10 inline self-set test files
are audited; tracking that as a follow-up issue.

* [Fix] Docker: Pin Wolfi And Uv To Multi-Arch Index Digests

The previous pins resolved to single-platform amd64 manifests, so buildx
pulled the same amd64 base for both linux/amd64 and linux/arm64 targets.
The published OCI index then advertised an arm64 entry whose layers are
byte-identical to amd64 -- arm64 users got an amd64 binary.

Switch all three Dockerfiles to the multi-arch image-index digests:
  - cgr.dev/chainguard/wolfi-base   (index has linux/amd64 + linux/arm64)
  - ghcr.io/astral-sh/uv:0.11.7     (index has linux/amd64 + linux/arm64)

Resolved with `docker buildx imagetools inspect <ref>` -- that returns
the index digest. `docker pull` + `docker inspect` returns the per-host
platform digest, which is what slipped in last time.

* [Fix] Docker: Pin Uv To Multi-Arch Index Digest In Remaining Dockerfiles

Apply the same fix to the three Dockerfiles not in the release pipeline
today (alpine, dev, health_check) so they stay correct if/when they're
built for arm64 in the future.

Wolfi pins are not present in these files; the python:3.11-alpine and
python:3.13-slim digests they already use are multi-arch indexes that
include arm64/v8, so only the uv pin needed swapping.

* fix(xai): fold reasoning_tokens into completion_tokens to satisfy OpenAI invariant

xAI's chat completions API accounts reasoning_tokens separately from
completion_tokens, but rolls them into total_tokens. This breaks the
OpenAI invariant total_tokens == prompt_tokens + completion_tokens
that downstream consumers (including litellm's own _usage_format_tests
in tests/llm_translation/base_llm_unit_tests.py:58) rely on.

Live capture (grok-3-mini-beta, 2026-05-04):
    prompt=14, completion=10, total=336, reasoning=312
    14 + 10 = 24, NOT 336.

OpenAI's o1/o3 reasoning models include reasoning_tokens in
completion_tokens, leaving the prompt+completion=total invariant
intact. xAI deviates. This patch aligns xAI to OpenAI semantics by
folding reasoning_tokens into completion_tokens after the parent
OpenAI parser runs.

The fold is idempotent and defensive:
- Only fires when total_tokens == prompt_tokens + completion_tokens
  + reasoning_tokens (the documented xAI shape). Refuses to fold if
  the gap doesn't match, guarding against silent corruption when xAI
  changes accounting.
- Skips if completion_tokens already covers the gap (already
  normalised — e.g. cost calc replays a previously-folded Usage).

xai.cost_calculator.cost_per_token already added reasoning_tokens to
the visible completion count for billing. Post-fold the Usage block
now satisfies that invariant directly, so the cost calc would
double-bill. Updated cost_per_token to detect the OpenAI-normalised
shape (total == prompt + completion) and skip the reasoning add-on
in that case, falling through to the legacy raw-shape behaviour for
callers that bypass the transformation (e.g. proxy log replay).

Tests:
- Adds TestXAIReasoningTokenFolding covering: gap-explained-fold,
  idempotent-no-double-fold, no-reasoning-skip, gap-mismatch-skip.
- Adds test_already_normalised_usage_does_not_double_count_reasoning
  to lock the cost-calc idempotency.
- Updates 7 pre-existing cost-calc tests whose total_tokens was
  internally inconsistent (used the OpenAI-normalised total but kept
  reasoning_tokens external) to use the documented xAI raw shape
  total = prompt + visible completion + reasoning. Pre-existing
  values masked the missing-fold by accident.

Verified end-to-end against the live xAI API:
    LITELLM_LOCAL_MODEL_COST_MAP=False (CI default) +
    XAI_API_KEY set +
    pytest tests/llm_translation/test_xai.py::TestXAIChat::test_prompt_caching
        -> PASSED in 18.81s (was: AssertionError on
        usage.total_tokens == usage.prompt_tokens + usage.completion_tokens)

20/20 tests in tests/test_litellm/llms/xai/test_xai_cost_calculator.py
and 8/8 in tests/test_litellm/llms/xai/test_xai_chat_transformation.py
pass.

* refactor(bedrock/converse): delegate effort gating to AnthropicConfig._validate_effort_for_model

Removes the duplicated max/xhigh gating logic in
_validate_anthropic_adaptive_effort and the now-unused
_supports_effort_level_on_bedrock helper. Per-model gating now flows
through the centralized AnthropicConfig._validate_effort_for_model
(whose _supports_effort_level already strips Bedrock prefixes), so the
chat completion, /v1/messages, and Bedrock Converse paths can't drift
when a new gated effort tier is added.

* Implement normalize_nonempty_secret_str function to trim whitespace from secrets and treat empty values as unset. Update proxy_server to use this function for Grafana credentials. Enhance tests to validate the new normalization behavior.

* Fix qdrant semantic cache miss metadata

* chore(deps): refresh dependency locks

* chore(deps): authorize pytest license

* fix: preserve tokenizer decode round trips

* refactor(anthropic): drive adaptive-thinking gate via supports_adaptive_thinking flag

Three of greptile's open comments on #27074 (P2 converse:512, P1
databricks:361, and the underlying capability-flag policy rule) flagged
the same pattern: _is_claude_4_6_model(...) or _is_claude_4_7_model(...)
used inline as a runtime 'is this an adaptive-thinking model?' check.
That requires a code release each time a new adaptive Claude lands.

Consolidate the inline gating to AnthropicModelInfo._is_adaptive_thinking_model,
and switch the helper itself to read a new supports_adaptive_thinking
flag from `model_prices_and_context_window.json` via `_supports_factory`,
falling back to the family pattern only when the model-map entry doesn't
carry the flag (preserves OpenRouter / Vercel / Bedrock-prefixed variants
that route through the same code path with non-canonical ids).

Adds `supports_adaptive_thinking: true` to the four 4.6/4.7 anthropic
entries (opus-4-6 + dated, opus-4-7 + dated, sonnet-4-6). Bedrock-prefixed
and Vertex-prefixed entries don't need the flag because both fall back
through the family pattern (the helper short-circuits early on True from
either path) and the bedrock/vertex Claude IDs all match the existing
opus-4-{6,7} / sonnet-4-{6,7} pattern.

Affected call sites:

- `bedrock/chat/converse_transformation.py:_handle_reasoning_effort_parameter`
- `anthropic/chat/transformation.py:_map_reasoning_effort`
- `anthropic/chat/transformation.py:map_openai_params` (output_config branch)
- `databricks/chat/transformation.py:map_openai_params` (output_config branch)

The remaining `_is_claude_4_6_model` / `_is_claude_4_7_model` references
in `AnthropicConfig._validate_effort_for_model` and
`AnthropicConfig.get_supported_openai_params` are intentionally retained:
they're per-model gating fallbacks for variants whose model-map entries
don't yet carry the `supports_max_reasoning_effort` /
`supports_reasoning` flag. Those are documented in-place.

Tests: 537 anthropic/bedrock/databricks/vertex/messages tests pass.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(deps): address dependency review notes

* test(model_prices): add supports_adaptive_thinking to schema

`test_aaamodel_prices_and_context_window_json_is_valid` validates the
model-map JSON against an explicit schema with `additionalProperties`,
so the new `supports_adaptive_thinking` flag added in
98ced0ae43 needs a matching schema entry.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor: remove unnecessary comments from #27074

Strip out the explanatory and historical comments that don't carry
business-logic justification. Comments that simply narrate what code
does — or that explain prior behavior, what was changed, or which PR
introduced a fix — are removed. Docstrings are reduced to a one-line
summary where the long form repeated information already evident from
the code or test data.

No code-behavior changes. All 643 affected unit tests still pass.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test: keep decode token test local

* chore(deps): align dashboard node engine

* feat: selectively apply routing strategy according to model name

* style: make _model_supports_effort_param more concise

* refactor(anthropic,bedrock): hoist drop_params output_config warning to module constant

Three call sites (anthropic chat, bedrock converse, bedrock invoke messages)
emitted the same '...Effort is only supported on Opus 4.5+, Sonnet 4.6+, and
Mythos Preview' warning verbatim. Extract DROP_UNSUPPORTED_OUTPUT_CONFIG_WARNING
in litellm/llms/anthropic/chat/transformation.py and import it from the bedrock
sites so future copy edits live in one place.

Addresses Michael's review on PR #27074.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* refactor(anthropic,bedrock,databricks): factor BadRequestError for unknown reasoning_effort

Three call sites raised the same BadRequestError("Invalid reasoning_effort:
... Must be one of 'minimal', 'low', ...") block when REASONING_EFFORT_TO_OUTPUT_CONFIG_EFFORT
returned None: anthropic chat map_openai_params, bedrock converse
_handle_reasoning_effort_parameter, and databricks chat reasoning_effort path.

Extract AnthropicConfig._raise_invalid_reasoning_effort(model, value, llm_provider)
so future copy edits / valid-set changes happen in one place. Typed as NoReturn
so type-checkers correctly narrow control flow at call sites.

Addresses Michael's review on PR #27074.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* Clean up Redis semantic cache isolation fallback

* fix(guardrails): align banned_keywords + azure_content_safety call_type gates with runtime route_type

The hooks gated on ``call_type == "completion"`` but the proxy ingress
passes ``route_type`` straight through as ``call_type`` —
``"acompletion"`` for /v1/chat/completions and ``"aresponses"`` for
/v1/responses. Tests passed because they used the literal sync
``"completion"`` value, masking the gap.

Switch both hooks to ``is_text_content_call_type`` (matches the
canonical runtime values: completion / acompletion / aresponses) and
update existing tests to assert against runtime values, plus parametrize
a regression test that pins the gate.

* fix: remove unused import

* Add semantic cache legacy migration flag

* Treat 0 team_member_budget as no cap

* chore(caching): annotate qdrant quantization_params dict type

Mypy infers the dict's value type from the first branch
(Dict[str, bool]) which clashes with the scalar branch's mixed-type
inner dict. Explicit Dict[str, Any] annotation lifts the inference.

* chore(caching): remove allow_legacy_unscoped_cache_hits opt-in

The flag was an opt-in escape hatch for the cross-tenant leak the rest
of the patch closes — flipping it on (env var or constructor param)
re-enables exactly the VERIA-54 primitive on either backend. There is
no operational need that the secure path doesn't already meet:

- Qdrant: legacy points without ``litellm_cache_key`` payload are
  excluded by the must-clause filter and treated as misses; new sets
  populate the cache key, so cold-start lasts only as long as the
  natural cache rebuild.
- Redis: existing unscoped index can't carry the new schema; the init
  path falls back to ``{name}_isolated`` (and recreates it on stale
  schema), leaving the legacy index untouched.

Drop the construc…
Setsuna-Yukirin pushed a commit to Setsuna-Yukirin/litellm that referenced this pull request May 17, 2026
Encode the 231-cell QA sweep (21 provider x model combos x 11 effort
values) from BerriAI#27039 / BerriAI#27074 as an automated CircleCI-gated regression
suite. Each cell hits the real provider endpoint, captures the outgoing
wire body via a pre-call CustomLogger, and asserts:

- thinking.type, output_config.effort, thinking.budget_tokens, max_tokens
  in the captured request body (regression signal for silent drops/strips
  in any provider transformation)
- HTTP status (200 vs BadRequestError -> 400) returned by litellm
  (regression signal for clean-error vs leaked-500 mappings)

The matrix is encoded as a small rule set keyed by (model_mode, effort)
plus per-model xhigh/max capability overrides, then expanded across the
five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex
AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke
/v1/messages route. Cells skip at runtime when the route's provider env
vars are absent, so PR builds without credentials no-op gracefully.

Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the
existing main / litellm_* branch filter.
Sameerlite added a commit that referenced this pull request May 22, 2026
* test(vcr): classify cache verdicts, detect live calls, surface cost leaks

Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop dead 'from respx import MockRouter' imports

These seven test files were on _RESPX_CONFLICTING_FILES, which made the
auto-marker skip them entirely. Inspecting the source shows the only
respx artifact is a top-level 'from respx import MockRouter' that no
test ever uses - no @pytest.mark.respx, no respx_mock fixture, no
respx.mock context manager. The import is dead code left over from a
previous mocking pattern.

Now that apply_vcr_auto_marker_to_items detects respx per-item via the
marker / fixture chain (b637d9f64a), the file-level skip is no longer
needed for these files - they were the reason the OpenAI tests
(test_o3_reasoning_effort, test_streaming_response[o1/o3-mini],
TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search,
TestOpenAIO3::test_web_search, etc.) ran live every CI build despite
the cassette cache being healthy.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(image_edits): regenerate fixtures per call instead of holding open module-level file handles

Module-level

    TEST_IMAGES = [
        open(os.path.join(pwd, 'ishaan_github.png'), 'rb'),
        open(os.path.join(pwd, 'litellm_site.png'), 'rb'),
    ]
    SINGLE_TEST_IMAGE = open(...)

opens the file once at import. After the first multipart upload, the
file pointer is at EOF, so every subsequent test in the same xdist
worker sends an empty multipart body. That non-determinism (a) blows
the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so
_RedisPersister.save_cassette refuses to save it, and (b) re-bills the
live image edit endpoint on every CI run.

Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py
shows six tests parking at 51-52 cassette entries
(TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False],
TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio,
test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False],
test_multiple_image_edit_with_different_formats).

Replace the module-level file handles with _make_test_images() /
_make_single_test_image() factories that return fresh _RewindableImage
(BytesIO subclass) objects whose pointer always starts at 0. The image
bytes are read once at import into module-level constants
(_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is
unchanged.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): match real Bedrock hostnames in live-call probe

The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock
endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com'
(region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit
host check for that pattern so Bedrock live calls are visible to the
probe, and update the unit test accordingly. Also drop the unused
'_LIVE_CALL_PROBE_INSTALLED' module variable.

* fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes

* fix(image_edits): drop _RewindableImage to prevent infinite multipart upload

The _RewindableImage(BytesIO) wrapper auto-rewound on every read after
EOF, which made the OpenAI SDK's multipart upload writer read the same
bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd:

    [gw0] node down: Not properly terminated
    replacing crashed worker gw0
    ...
    worker 'gw1' crashed while running
        'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]'

The auto-rewind was added defensively for parametrized + flaky-retried
tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk
already calls get_base_image_edit_call_args() once per invocation and
that helper now constructs fresh streams via _make_test_images(), so
rewinding inside the stream is unnecessary. Replace with plain BytesIO
seeded with the cached image bytes.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible

The pass_through prompt-caching tests
(test_prompt_caching_returns_cache_read_tokens_on_second_call,
test_prompt_caching_streaming_second_call_returns_cache_read) make a
warm-up call and then assert the *second* call sees a non-zero
cache_read_input_tokens count from the upstream's prompt-cache. VCR
replay can't model cross-call provider state — both calls match the
same cassette episode, so the second call returns the first call's
pre-warmup response and the assertion fails:

    AssertionError: Expected cache_read_input_tokens > 0 on second call,
    but got 0. Full usage: {'input_tokens': 4986,
    'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0}

This started biting after the AWS SigV4 fingerprint stabilization
(b637d9f64a): Bedrock requests now produce a stable per-access-key
fingerprint instead of a per-request signature, so cassettes
successfully replay where they previously always missed and re-recorded
live. Opt these tests out via skip_nodeid_suffixes so they run live and
match the existing pattern in tests/llm_translation/conftest.py
(::test_prompt_caching).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): tighten OVERFLOW classification and switch respx detection to AST

Address two greptile P2 review concerns on PR #27795:

1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE
   regardless of cassette state. A cassette that grew past the cap
   historically but this run only *replayed* (dirty=False) is
   healthy — the persister never tries to save, so the cache state is
   stable and the next run will replay too. Only flag OVERFLOW when
   dirty=True (new episodes were recorded that the persister would
   refuse to save). Add a regression test covering the
   dirty=False + large-total case.

2. _module_uses_respx did substring matching on the module source,
   which false-positives on comments / docstrings / string literals.
   A comment like # Previously tried respx.mock but switched to
   vcrpy would keep a file pinned on the opt-out list, defeating the
   dead-import pruning goal of this PR. Replace the substring scan
   with an ast.NodeVisitor (_RespxUsageVisitor) that only
   counts:

     - @pytest.mark.respx / @respx.mock decorators
     - with respx.mock(): ... (sync + async) context managers
     - respx.mock(...) calls outside a with/decorator
     - function parameters / fixture names equal to respx_mock

   Add tests for the comment / docstring / string-literal cases plus
   each real-usage pattern.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist

`_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate`
— which runs in each xdist worker process. The controller's
`pytest_terminal_summary` then reads its own empty `_session_stats` and
bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections
the rest of this PR adds never make it into CI logs in the dist mode CI
actually uses.

Ship a structured `vcr_outcome` payload via `user_properties` (which
xdist round-trips) and add `aggregate_report_outcome` on the controller
to fold worker outcomes into `_session_stats`. The recording process
tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can
tell "single-process — already counted locally" apart from "produced by
a worker — needs aggregation here", and not double-count when there's
no xdist.

Covered by 9 new unit tests in test_vcr_classification.py including the
end-to-end summary render path.

* fix(guardrails): improve CrowdStrike AIDR input handling (#26658)

* feat(lasso): add tool-calling support to LassoGuardrail (#27648)

* feat(lasso): extend LassoGuardrail to support tool calling (RND-5748)

* fix(lasso): PR review followups for tool-calling guardrail (RND-5748)

* fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748)

* fix(lasso): use model role for tool_use blocks (RND-5748)

* test(lasso): add round-trip tests for message transformation (RND-5748)

* fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748)

* fix(lasso): inspect Responses-API input field (RND-5748)

* fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748)

* fix(lasso): flatten list content in tool_result.content (RND-5748)

* fix(lasso): remap multimodal list content during masking (RND-5748)

Bug: _map_masked_messages_back counted list-content messages in
original_text_count but the remap loop only handled isinstance(str).
The positional text_cursor never advanced for list messages, causing
all subsequent masked texts to be written onto the wrong messages.

Fix: added elif isinstance(content, list) branch that replaces the
list with the masked text string and advances the cursor — mirrors
the existing string-content branch. Also handles the assistant +
tool_calls combo for list-content messages.

Test: test_map_masked_messages_back_list_content verifies a user
message with [text + image_url] followed by an assistant message
gets correct masked content on both (cursor stays aligned).

* refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748)

The dict-vs-object access pattern (x.get('y') if isinstance(x, dict)
else getattr(x, 'y', None)) was duplicated 14 times across 5 methods.

_get_field(obj, field) — single-point dict/Pydantic field access.
_extract_tool_call_fields(call) — returns (call_id, name, parsed_input)
with JSON argument parsing, replacing ~30 duplicate lines in both
async_post_call_success_hook and _expand_messages_for_classification.

Also simplified _update_tool_calls_from_masked, _prepare_payload tool
mapping, and _apply_masking_to_model_response call_id extraction.

Net ~60 lines removed. No behavior change — all 32 tests pass.

* fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748)

_apply_masking_to_model_response used a bare text_cursor without
verifying 1:1 correspondence between text-bearing choices and masked
text entries. If Lasso returned a different number of text messages
than choices with content, masked text would be applied to the wrong
choice or silently skip choices.

Added the same count-mismatch guard pattern already used in
_map_masked_messages_back: count original text-bearing choices,
compare to masked_text length, skip text remap on mismatch with a
warning log. Tool_call masking via id-based lookup is unaffected.

Tests:
- test_apply_masking_to_model_response_multiple_choices: verifies
  correct per-choice masked text with 2 choices
- test_apply_masking_to_model_response_count_mismatch: verifies
  content is left unchanged when counts disagree

* fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748)

* tool-call args: when function.arguments is malformed JSON or parses
  to a non-object, preserve the raw string as {"arguments": <raw>} so
  Lasso still inspects it instead of receiving input=None. Covers both
  pre-call and post-call extraction (shared helper). Also resolves the
  CodeQL empty-except warning since the except body now assigns parsed=None.
* Responses-API input: when a request carries both "messages" and
  "input", inspect both. Previously a benign messages array let the
  guardrail skip data["input"] entirely. The masking write-back is
  split via a count boundary so masked messages flow back to
  data["messages"] and masked input flows back to data["input"]
  without cross-contamination.

Tests: malformed/non-object args round-trip, dual-field classification,
dual-field masking write-back split.

* chore(lasso): black formatting + comment on expand skip branch (RND-5748)

* black: wrap two long expressions in lasso.py and reformat dict
  literals in test_lasso.py to satisfy CI lint.
* add a short comment in _expand_messages_for_classification
  explaining why empty string and None content are intentionally
  skipped (None is the OpenAI shape for a pure tool-call turn).

* fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748)

* Narrow `response.get("messages")` into a local before slicing so
  mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable.
* Rename the two write-side `func` bindings in
  `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so
  mypy doesn't unify the dict and Any|None branches.
* Rename the inner loop variable in `_apply_masking_to_model_response`
  from `msg` to `masked_msg` to avoid clashing with the
  `msg = choice.message` rebinding below.

No behavior change; resolves the 7 mypy errors from the CI lint job.

* perf: eliminate per-request callback scanning on proxy hot path (#27858)

- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead
- Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered
- Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active
- Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields
- Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk
- Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement
- Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support
- Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* ci(mutmut): enable mutate_only_covered_lines to fit in CI budget (#27910)

The mutation-test workflow timed out at the 350-minute job cap when
running whole-folder mutation against litellm/proxy/management_endpoints/
(~30 files, ~1.5 MB of source). Every mutant was running the full
test suite, and mutants were generated for lines no test covers — which
would survive regardless, just wasting compute.

mutmut 3.x's mutate_only_covered_lines setting runs the suite once up
front to compute coverage, then skips mutating uncovered lines. This
cuts the mutant count dramatically and is the right semantic for the
score (no test → no kill possible → uncountable). Per-mutant test
filtering by function name is already automatic in mutmut 3.x; no
external coverage step is needed.

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body (#27913)

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body

PR #27001 (atomic TPM rate limit) introduced a reservation flow that
writes four LiteLLM-internal keys onto the request data dict:

  _litellm_rate_limit_descriptors
  _litellm_tpm_reserved_tokens
  _litellm_tpm_reserved_model
  _litellm_tpm_reserved_scopes
  _litellm_tpm_reservation_released

These keys are forwarded as request body params to the upstream provider,
which rejects them as unknown fields:

  OpenAI    -> 400 'Unknown parameter: _litellm_rate_limit_descriptors'
              (mapped by litellm to RateLimitError / 429, hiding the bug
               behind a misleading 'throttling_error' code)
  Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are
               not permitted'

Net effect: every chat completion against any real provider fails the
moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced
key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check
itself still runs (raises 429 on over-limit), but the success path
poisons the upstream body.

Reproduced on litellm_internal_staging HEAD (410ce761dc) against
gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request
fails with the provider's unknown-field error.

Fix: the stash is metadata only.

  - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS
    registry so we have a single source of truth for stash keys.
  - New helper _stash_value_in_metadata_channels writes to
    data['metadata'] / data['litellm_metadata'] without touching the
    top level.
  - _stash_reservation_in_data and the descriptor stash now route
    through that helper. _mark_reservation_released stops writing
    top-level.
  - _lookup_stashed_value also checks kwargs['metadata'] /
    kwargs['litellm_metadata'] (raw request_data shape) in addition to
    kwargs['litellm_params']['metadata'] (completion kwargs shape).
  - async_post_call_failure_hook now reads descriptors via the unified
    metadata lookup instead of request_data.get(top-level).
  - Defense in depth: async_pre_call_hook strips any stash key that
    somehow surfaced at the top level (stale cache, future refactor,
    test fixture) before returning.

Tests:
  - New regression test asserts no _litellm_* stash key is present at
    the top level of data after async_pre_call_hook, and that the
    metadata channel still carries the reservation + descriptors so
    success / failure reconciliation works.
  - Existing test_tpm_concurrent.py tests that asserted top-level
    presence are updated to read from data['metadata'] — the location
    is an implementation detail; the spec is that post-call callbacks
    can resolve the stash.

Verified end-to-end against OpenAI gpt-4o-mini and Anthropic
claude-haiku-4-5 via /v1/chat/completions on a low-rpm key:

  - With limits not exceeded: HTTP 200, valid completion response,
    no leaked fields in body.
  - With RPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: requests').
  - With TPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: tokens').

Full v3 hook test suite passes (171 tests).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments

Address greptile P2: test fixture now uses the imported constant.
Drop comments that re-explain what well-named identifiers already convey.

* fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse

Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at
the start of async_pre_call_hook. Without this, an authenticated caller can
inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in
body metadata, trigger a proxy-side rejection, and cause
async_post_call_failure_hook to refund TPM counters against attacker-named
scopes (e.g. another tenant's api_key).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: allow for allowlisted redirect URIs (#27761)

* fix: allow for allowlisted redirect URIs

* github comment addressing

* Update litellm/proxy/_experimental/mcp_server/oauth_utils.py

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* harden oauth wildcard further

* test: cover wildcard entry with dot-leading suffix rejection

---------

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* Emit native web_search_tool_result blocks for Anthropic clients (Claude Desktop / Cowork citations) (#27886)

* feat(custom_logger): add async_post_agentic_loop_response_hook

Lets a CustomLogger shape the response returned by the agentic-loop
follow-up call without bypassing the loop's safety / observability
machinery (depth tracking, fingerprinting, etc.). Default returns the
response unchanged.

Used by websearch_interception to inject Anthropic-native
web_search_tool_result blocks when the originating client requested a
native web_search_* tool.

* feat(llm_http_handler): call post-agentic-loop hook on the originating callback

In _execute_anthropic_agentic_plan, after anthropic_messages.acreate
returns, call the originating callback's
async_post_agentic_loop_response_hook so it can mutate the final
response (e.g. inject native tool_result blocks). Pass the callback
through from _call_agentic_completion_hooks.

Exceptions in the post-hook are caught and logged so a buggy callback
can't kill the request.

* feat(websearch_interception): add is_anthropic_native_web_search_tool

Identifies tools the Anthropic-native clients (Claude Desktop, the
Anthropic SDK, the Anthropic Console) use to request native search:
type starts with "web_search_" (e.g. web_search_20250305). Rejects the
LiteLLM standard tool, the OpenAI-function variant, the bare
"WebSearch" legacy name, and the bare "web_search" Claude Code shape.

This lets us decide per-request whether the client expects
web_search_tool_result content blocks in the response, without
renaming any existing constants or touching native-provider skip
logic.

* feat(websearch_interception): add build_web_search_tool_result_block

Produces the Anthropic-native web_search_tool_result content block
from a structured SearchResponse. Anthropic-native clients use this
block to populate citations / source links — the existing text-blob
flatten path only feeds readable evidence to the model and discards
the structure, so this builder gives us the missing piece.

Shape matches https://docs.anthropic.com/en/api/web-search-tool —
web_search_result items carry url, title, page_age, encrypted_content
(empty string when the search provider doesn't supply one).

* feat(websearch_interception): emit native web_search_tool_result blocks

When the originating client request carried a native Anthropic
web_search_* tool, the final response now also carries
web_search_tool_result content blocks alongside the model's text
answer — so Claude Desktop / Anthropic SDK clients can populate the
citations panel and replay conversation history with structured search
evidence.

Wiring:
- Pre-request hooks (both deployment + Anthropic path) set a flag on
  kwargs when they see a native web_search_* tool, so the signal
  survives the conversion-to-litellm_web_search step regardless of
  which hook fires first.
- _execute_search now returns (text, SearchResponse) so the structured
  results aren't lost when the text is flattened for the follow-up
  model call.
- _build_anthropic_request_patch returns the parallel list of
  SearchResponse objects.
- async_build_agentic_loop_plan pre-builds the web_search_tool_result
  blocks (one per tool_use_id) and stashes them on plan.metadata when
  the flag is set.
- async_post_agentic_loop_response_hook reads the metadata and
  prepends the blocks to response.content.
- _execute_agentic_loop mirrors the injection for the legacy path so
  both paths behave identically.

Clients that send the LiteLLM standard tool keep the existing
text-only behavior — no regression.

* test(websearch_interception): cover native web_search_tool_result emission

18 tests across:
- detector branches (native vs litellm-standard, OpenAI-function shape,
  Claude Desktop builtin WebSearch, bare web_search, missing type)
- block-builder shape (results, none, empty)
- pre-request hook flag-setting (native sets, standard does not)
- async_build_agentic_loop_plan attaches blocks to plan.metadata when
  the flag is present, leaves metadata untouched when absent
- post-hook injection into dict and object responses
- legacy _execute_agentic_loop mirrors the injection so both paths
  return the same shape

* test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return

* test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return

* feat(websearch_interception): emit native blocks from try_short_circuit_search

The agentic-loop post-hook only fires when the model returns a tool_use
block. Cowork / Claude Desktop on Bedrock actually make TWO requests
per user turn: the main /v1/messages with their builtin tool, and a
separate standalone /v1/messages whose only tool is
web_search_20250305. That second request hits try_short_circuit_search
— no agentic loop, no post-hook — and was returning text-only, leaving
the citations panel empty.

When the short-circuit input carries a native web_search_* tool, build
a synthetic server_tool_use + web_search_tool_result pair (using the
structured SearchResponse already returned by _execute_search) so the
client gets the native shape it expects. The legacy text block is
preserved so non-native short-circuit callers (Claude Code,
github_copilot, etc.) see the same payload as before.

Failure path still emits the native block pair (with empty results)
plus the text-error block, so the client gets a well-formed response
rather than a malformed half-shape.

* test(websearch_native_blocks): cover short-circuit native-block emission

Three new cases on top of the existing 18:
- native web_search_20250305 short-circuit → [server_tool_use,
  web_search_tool_result, text], ids paired, urls/titles carried.
- litellm_web_search short-circuit → text-only (no regression).
- native short-circuit on search failure → still emits the native
  block pair (empty results) plus the text-error block, so the client
  never sees a malformed half-shape.

* test(websearch_short_circuit): index assertions by block type, not by position

Native short-circuit responses now have [server_tool_use,
web_search_tool_result, text] when the input carries
web_search_20250305 — find the text block by type rather than relying
on content[0].

* fix(websearch_interception): gate legacy WebSearch name on schema absence

Clients like Cowork / Claude Desktop ship a client-side tool named
"WebSearch" with a full input_schema — they handle it themselves and
expect to make a separate native web_search_20250305 sub-request for
the actual search.

Today is_web_search_tool matches the bare name regardless of other
fields, which hijacks the client's tool server-side. The agentic loop
fires on the main request, the model never gets to emit the
client-side tool_use, and the separate native sub-request (where
citation data flows) is never made. Net: citations panel empty.

Real Anthropic client tools always carry input_schema (the API rejects
them otherwise), so a bare {name: "WebSearch"} with no schema is the
only thing that could be a legacy interception marker. Gate the match
on schema absence: legacy callers (if any) keep working, real
client-side WebSearch tools pass through untouched.

* fix(websearch_interception): drop "WebSearch" from response-detection lists

Post-conversion the model always sees ``litellm_web_search``, so the
"WebSearch" entry in the response-side tool_use detection lists was
dead at best. If a model ever did return ``tool_use(name="WebSearch")``
it would now (incorrectly) hijack the client's own ``WebSearch`` tool
again — same Cowork problem we just fixed on the input side. Drop it.

* test(websearch_native_blocks): cover the WebSearch legacy-name schema gate

Three new cases:
- {name: "WebSearch"} (bare interception marker) → still matched
- {name: "WebSearch", input_schema: {...}} (Cowork client tool) →
  passes through untouched
- {name: "WebSearch", description: "..."} (no schema) → still matched
  on the assumption it's a legacy marker rather than a malformed real
  client tool.

---------

Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>

* ci(codecov): restore litellm/ prefix on uploaded coverage paths

pytest-cov runs with --cov=litellm, which makes coverage.xml store paths
relative to the package root (e.g. `proxy/proxy_server.py` instead of
`litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when
the basename is unique in the repo. Files like proxy_server.py, router.py,
utils.py, main.py, and constants.py — which have duplicates under
enterprise/ or other subpackages — get silently dropped during ingest.

The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded
path so they resolve unambiguously. Confirmed against multiple recent
coverage.xml artifacts that no uploader currently emits paths already
prefixed with `litellm/`, so the rule is safe to apply universally.

This restores Codecov visibility for the highest-fix-rate hotspots:
proxy_server.py, router.py, proxy/utils.py, litellm_logging.py,
constants.py, key_management_endpoints.py, utils.py, main.py,
user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py.

* chore(ci): remove unused GitHub Actions workflows and orphan files

Audit of .github/workflows/ via gh run history shows the following have
either never run or have been dormant for 10+ weeks. CI coverage that
still matters is preserved on CircleCI (e.g. llm_translation_testing).

Removed workflows:
- test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled);
  CCI local_testing_part1/2 covers the same tests
- llm-translation-testing.yml — last run 2025-07-10; replaced by CCI
  llm_translation_testing job (run_llm_translation_tests.py kept for the
  make test-llm-translation target)
- run_observatory_tests.yml — last run 2026-03-03 (cancelled)
- scan_duplicate_issues.yml — last run 2026-03-02 (failure)
- publish_to_pypi.yml — never run
- read_pyproject_version.yml — fires on every push to main but its echoed
  version output is not consumed by any downstream step

Removed orphan files (no callers in workflows, CCI, or Makefile):
- .github/workflows/README.md — documented only publish_to_pypi.yml
- .github/workflows/update_release.py + results_stats.csv
- .github/actions/helm-oci-chart-releaser/

* Revert "ci(codecov): restore litellm/ prefix on uploaded coverage paths"

This reverts commit e25a988a3feb4a31843a67274a3a64fea2fed805.

The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's
auto-resolution, not before. Files with unique basenames (which were
auto-resolving correctly to `litellm/<path>`) got an extra `litellm/`
prepended, producing `litellm/litellm/<path>` storage. Files with
ambiguous basenames (the actual target of the fix) continued to be
dropped because the auto-resolution still failed for them.

Net result on the verification run: 1375 files now stored under
unresolvable `litellm/litellm/...` paths, and the 11 originally-missing
hotspots are still missing. Reverting before piling on further changes.

* test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock

Per-file `vi.mock("@tremor/react", ...)` factories fully replace the
setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip
overrides are lost in any file that re-mocks `@tremor/react`. Without
them, the real Tremor `<Button>` leaks through and its internal
`useTooltip(300)` schedules a native 300ms `setTimeout` on pointer
events. When the test environment is torn down before the timer fires,
the trailing `setState` calls `getCurrentEventPriority`, which reads
`window.event` against a destroyed jsdom -> "window is not defined"
flake observed on CI.

Patches the 7 leaky test files to re-supply `Button` (bare `<button>`)
and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops
a dead `afterEach` workaround in `user_edit_view.test.tsx` (the
fake-timer dance it ran could not drain a real timer scheduled before
the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`.

* ci: use --cov=./litellm so coverage paths resolve unambiguously in Codecov

pytest-cov treats --cov=<module-name> as a Python package and emits XML
paths relative to the package root, stripping the litellm/ prefix
(`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`).
Codecov's auto-prefix heuristic then drops every file whose basename is
ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/),
`router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py`
(2). The 11 highest-fix-rate hotspots have never appeared in Codecov.

Switching to --cov=./litellm treats the argument as a path, which makes
coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`).
Each path is unambiguous, so Codecov resolves all files correctly.

Verified locally: rerunning a single proxy_unit_tests test with
--cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`,
`filename="litellm/router.py"`, and `filename="litellm/types/router.py"`
as distinct entries — exactly the disambiguation Codecov needs.

Touches every workflow that uploads coverage: the two reusable GHA
workflows (_test-unit-base.yml, _test-unit-services-base.yml),
test-mcp.yml, and all 14 invocations in .circleci/config.yml.

* fix(mcp): allow delegate PKCE bypass for internal MCP servers

Remove available_on_public_internet gating from delegate-auth-to-upstream
paths so oauth2 + delegate_auth_to_upstream interactive servers behave
the same when marked internal. Keeps M2M exclusion. Updates tests.

* chore(mcp): warn on internal + upstream PKCE delegate

Log verbose_logger.warning when loading oauth2 interactive servers with
available_on_public_internet=false and delegate_auth_to_upstream=true
(config + DB). Dashboard Alert for the same combo. CLAUDE note for
operators. Tests for log and M2M skip.

* fix(mcp): dedupe load_servers_from_config alias block

Removes accidental duplicate alias/mcp_aliases and get_server_prefix
logic (fixes PLR0915 and avoids resetting alias after mapping).

* fix(mcp): expose delegate_auth_to_upstream in MCP server list rows (#27936)

_build_mcp_server_table omitted delegate_auth_to_upstream, so GET /v1/mcp/server always returned the default false while the registry kept the DB value.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(proxy): fix vector store retrieve/list/update/delete without model (#27929)

* feat(proxy): fix vector store retrieve/list/update/delete routing without model

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): remove unchecked query-param injection in vector store management endpoints

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(proxy): use subset assertion for vector store route test to allow extra kwargs like shared_session

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller (#27984)

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller

CheckBatchCost bypasses async_post_call_success_hook, causing raw provider
output_file_ids to be persisted in LiteLLM_ManagedObjectTable. This fix converts
output_file_id and error_file_id to managed base64 IDs before the DB write.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(check_batch_cost): persist managed file before mutating response and propagate team_id

- Move setattr after store_unified_file_id so the response only receives the
  managed ID once the DB record is successfully written. Avoids serializing
  an orphaned managed ID into file_object when the store call fails.
- Populate team_id on the minimal UserAPIKeyAuth from job.team_id so the
  managed file record is created with the correct team ownership, allowing
  other team members to access the batch output file via /files/{id}/content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(managed_batches): extend test to cover error_file_id conversion

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix managed file test

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs (#27912)

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs

Vertex batch jobs recorded 0 spend and 0 tokens after PR #25627 added
automatic transformation of GCS predictions.jsonl to OpenAI format.

Two bugs fixed:

1. batch_utils.py: the Vertex-specific cost/usage reader
   (calculate_vertex_ai_batch_cost_and_usage) was always invoked and
   reads raw usageMetadata fields that no longer exist in the
   OpenAI-shaped output. Now the reader is only used when
   disable_vertex_batch_output_transformation=True; otherwise the
   generic path handles the already-transformed OpenAI-shaped content.

2. cost_calculator.py: batch_cost_calculator skipped the global
   litellm.get_model_info() lookup when a model_info dict was passed
   in, even when that dict had no pricing fields (e.g. deployment
   metadata with only id/db_model). It now falls back to the global
   pricing table when the provided model_info has no pricing data.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update litellm/cost_calculator.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(cost-calculator): use not-any guard for pricing fallback in batch_cost_calculator

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cost-calculator): treat explicit zero batch pricing as set in model_info

The fallback to litellm.get_model_info() used truthy checks on pricing
fields, so 0.0 was treated as missing and replaced by global rates.
Use `is not None` like elsewhere in cost calculation. Add regression test.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* feat: add weighted-routing failover (#27980)

* Feat: Add Weighted-Routing Failover

* test(router): cover weighted failover helper functions

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): align weighted failover deployment list type with mypy

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): address greptile review on weighted failover

- Narrow exception swallowing in `_maybe_run_weighted_failover` to
  `openai.APIError` so model failures defer to the regular fallback
  while programming bugs (AttributeError/KeyError/TypeError) surface.
- Note async-only limitation of `enable_weighted_failover` in the
  Router constructor docstring.
- Make the weighted distribution test less flaky (1000 iterations,
  looser bound) and make the non-simple-shuffle test deterministic by
  failing both deployments instead of relying on the latency strategy's
  first pick.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): ensure weighted failover metadata persists in kwargs

The previous `kwargs.setdefault(metadata_variable_name, {}) or {}` returned
a brand-new dict whenever the existing metadata was falsy (empty dict or
None), so writes to `_failover_excluded_ids` never made it back into
`kwargs`. Multi-hop weighted failover then re-selected previously failed
deployments and exhausted `max_fallbacks` prematurely.

Explicitly assign a fresh dict into kwargs when metadata is missing so
mutations are visible to subsequent failover hops.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(router): regression for weighted failover metadata persistence

Asserts kwargs["metadata"]["_failover_excluded_ids"] is populated after
_maybe_run_weighted_failover, proving the metadata dict written by the
helper is the same object that lives in kwargs (no disconnected copy).
Pairs with the prior fix that replaced `setdefault(..., {}) or {}` with
an explicit get/assign so writes survive across hops.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): harden weighted failover error/state handling

- Catch RouterRateLimitError (ValueError) alongside openai.APIError in
  _maybe_run_weighted_failover so an exhausted intra-group retry falls
  through to the regular cross-group fallback path instead of bubbling
  out and bypassing configured fallbacks.
- Stop mutating the shared input_kwargs dict; build a local copy with
  the weighted-failover keys so the entry (with _excluded_deployment_ids)
  cannot leak into later fallback paths reading the same dict.
- _get_excluded_filtered_deployments now returns an empty list when the
  exclusion filter removes every healthy deployment, instead of falling
  back to the original list. The original-list behavior risked re-picking
  the just-failed deployment; callers already handle the empty case by
  raising their no-deployments error, which weighted failover now catches
  and converts into a normal cross-group fallback.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): fall through to rpm/tpm when total weight is zero

When the weight metric's total is zero (e.g. after weighted-failover
exclusion leaves only zero-weight backups), continue to the next metric
(rpm/tpm) instead of returning a uniform random pick immediately. This
lets rpm/tpm still drive routing when present, and only falls back to
the uniform random pick at the end if no metric provides a positive
total weight.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): skip weighted failover when remaining deployments are all in cooldown

_maybe_run_weighted_failover was computing 'remaining' from all_deployments
(every deployment in the model group, including those in cooldown). This meant
that when all non-excluded deployments were in cooldown the method still invoked
run_async_fallback unnecessarily, which propagated into async_get_healthy_deployments,
found no eligible deployments, and raised RouterRateLimitError — only safely
caught thanks to the earlier exception-broadening fix.

The fix: before computing 'remaining', fetch the current cooldown set via
_async_get_cooldown_deployments and subtract it from all_ids. This allows
_maybe_run_weighted_failover to return None immediately (skipping the
run_async_fallback call entirely) when every non-failed deployment is in cooldown,
letting the caller fall through to the correct cross-group fallback path without
the wasteful extra round-trip.

Tests added:
- unit: _maybe_run_weighted_failover returns None without calling run_async_fallback
  when all remaining deployments are in cooldown
- unit: _maybe_run_weighted_failover still calls run_async_fallback when at least
  one healthy (non-cooldown) deployment is available
- integration: end-to-end fallthrough to cross-group fallback when remaining
  deployments are in cooldown

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpo… (#27976)

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint (#27943)

* docs: add one-line docstring to _disable_debugging (#27894)

Squash-merged by litellm-agent from oss-agent-shin's PR.

* Add jp. Bedrock cross-region inference profile for claude-sonnet-4-6 (#27831)

Squash-merged by litellm-agent from Cyberfilo's PR.

* Sanitize empty text content blocks on /v1/messages (#27832)

Squash-merged by litellm-agent from Cyberfilo's PR.

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint

The bedrock-mantle gateway (Claude Mythos Preview) serves the Anthropic
Messages API at /anthropic/v1/messages; /v1/messages returns 404 Not
Found. Both AmazonMantleConfig (chat/completions caller route) and
AmazonMantleMessagesConfig (anthropic-messages caller route) hardcoded
the wrong path, so every Mantle request 404'd before reaching the model.

Per the Anthropic docs: "[Claude in Amazon Bedrock] uses the Messages
API at /anthropic/v1/messages with SSE streaming."
https://platform.claude.com/docs/en/api/claude-on-amazon-bedrock

Confirmed independently against the live endpoint:
  /v1/chat/completions      -> 200 OK
  /v1/messages              -> 404 Not Found  (what litellm used)
  /anthropic/v1/messages    -> 200 OK         (Claude only)

Adds a regression test asserting both Mantle configs build the
/anthropic/v1/messages path, and updates the existing assertions that
encoded the wrong path.

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>

* fix: sanitize empty text blocks in sync anthropic_messages_handler path

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com>
Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(utils): import get_secret at runtime (#28014)

* fix(proxy): make /config/update env-var encryption idempotent

A single decrypt-then-encrypt chokepoint (_encrypt_env_variables_for_db)
now backs both update_config and save_config. Re-submitting a value the
Admin UI read back from /get/config/callbacks as ciphertext no longer
stacks a second encryption layer, which previously decrypted to garbage
and silently broke the callback. The chokepoint decrypts with the pure
_decrypt_db_variables (no os.environ mutation on the write path) and
encrypts exactly once; update_config merges only the sent keys so
untouched env vars keep their stored ciphertext byte-for-byte.

* test(proxy): add endpoint-level regression for /config/update double-encryption

Adds test_update_config_env_var_round_trip_not_double_encrypted, which
drives the real /config/update handler: first write plaintext, then
re-POST the stored ciphertext (the Admin UI round-trip) and assert the
value is not stacked with a second encryption layer and untouched keys
stay byte-identical. Verified to fail against the pre-fix handler and
pass after. Also tightens the unit test to exactly three ciphertext
re-feeds.

* chore(ci): modernize model references in tests and configs (#27856)

* test: modernize models used in CircleCI e2e test suites

Replaces obsolete models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo,
claude-3-5-sonnet-20240620, claude-sonnet-4-20250514) with current
equivalents across the e2e_openai_endpoints and
proxy_e2e_anthropic_messages_tests CircleCI jobs.

- gpt-4o -> gpt-5.5 (responses API e2e tests)
- gpt-4o-mini -> gpt-5-mini (websocket responses, oai_misc_config)
- gpt-4o-mini-2024-07-18 -> gpt-4.1-mini-2025-04-14 (fine-tuning,
  still actively fine-tunable)
- gpt-4 / gpt-3.5-turbo target_model_names example -> gpt-5.5 /
  gpt-5-mini
- bedrock claude-3-5-sonnet-20240620 batch entry -> haiku-4-5-20251001
  (also aligning oai_misc_config model_name with what
  test_bedrock_batches_api.py actually requests)
- bedrock claude-sonnet-4-20250514 (deprecated, retires 2026-06-15)
  -> claude-sonnet-4-5-20250929

* test: point bedrock-claude-sonnet-4 alias at Sonnet 4.6, not 4.5

Greptile/Cursor flagged that after the previous commit, the
bedrock-claude-sonnet-4 alias collided with bedrock-claude-sonnet-4.5
(both pointed to claude-sonnet-4-5-20250929). Rename to
bedrock-claude-sonnet-4.6 and point it at the Sonnet 4.6 Bedrock ID
(us.anthropic.claude-sonnet-4-6, already in the litellm model
registry) so the alias name matches the underlying model version.

* test: modernize models across remaining CI-mounted configs & tests

Expands the modernization sweep to all CircleCI-mounted proxy configs
and to test directories where the model literal is a fixture/route key
(not the test's subject).

Config changes:
- proxy_server_config.yaml: bump gpt-3.5-turbo / gpt-3.5-turbo-1106 /
  gpt-4o / gemini-1.5-flash / dall-e-3 underlying models; rename
  gpt-3.5-turbo-end-user-test alias to gpt-5-mini-end-user-test; bump
  text-embedding-ada-002 underlying to text-embedding-3-small. User-
  facing aliases (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, etc.)
  preserved for backward compatibility with tests.
- simple_config.yaml, otel_test_config.yaml, spend_tracking_config.yaml:
  bump gpt-3.5-turbo underlying to gpt-5-mini.
- pass_through_config.yaml: claude-3-5-sonnet / claude-3-7-sonnet /
  claude-3-haiku entries replaced with claude-sonnet-4-5 / claude-
  haiku-4-5 / claude-opus-4-7.
- oai_misc_config.yaml: align alias name with the gpt-5-mini rename.

Test changes (proactive: claude-sonnet-4-20250514 / claude-opus-4-
20250514 retire 2026-06-15):
- tests/llm_translation/test_anthropic_completion.py: bump 3 references
  + paired Vertex AI ID to claude-sonnet-4-5.
- tests/llm_translation/test_optional_params.py: bump 2 references.
- tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py
  and test_bedrock_anthropic_messages_test.py: bump router fixtures
  using the deprecated model IDs.
- tests/pass_through_unit_tests/base_anthropic_messages_tool_search_test.py:
  modernize docstring examples.
- tests/test_end_users.py: update references to renamed alias.

* test: modernize placeholder model literals in router_unit_tests

Mass replace_all on fixture/placeholder model literals across the
router_unit_tests/ suite (model name is a routing key / label, not the
test subject). Sub-agent sweep so far — additional commits will follow
for logging_callback_tests/, enterprise/, top-level tests/test_*.py,
and other CI-mounted dirs.

Mappings applied:
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 / claude-3-opus-20240229 /
  claude-3-haiku-20240307 / claude-3-5-sonnet-20240620 ->
  claude-sonnet-4-5-20250929 / claude-opus-4-7 /
  claude-haiku-4-5-20251001 as appropriate

Explicitly preserved:
- gpt-4o-mini-* variants (transcribe, tts, etc.) where they're current
- gpt-4-turbo / gpt-4-vision-preview / gpt-4-0613 (subject literals)
- JSONL batch body literals
- Mock LLM response model fields (must match upstream)
- Fake/mock identifiers

* test: modernize placeholder model literals across remaining CI suites

Sub-agent sweep across logging_callback_tests/, guardrails_tests/,
enterprise/, pass_through_unit_tests/, otel_tests/,
llm_responses_api_testing/, batches_tests/, spend_tracking_tests/,
litellm_utils_tests/, unified_google_tests/, and a few top-level
tests/test_*.py files where the model literal is a fixture or
placeholder (router model_list, mock standard logging payload, mock
callback data) rather than the test's subject.

Mappings applied (see scope notes below):
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5.5 (corrected from initial gpt-5 — bare gpt-5
  is not a valid OpenAI alias; only gpt-5.5 / gpt-5.4 / gpt-5.2-codex
  / gpt-5-mini exist)
- gpt-4o-mini (bare) -> gpt-5-mini
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 -> claude-sonnet-4-5-20250929
- claude-3-opus-20240229 -> claude-opus-4-7
- claude-3-haiku-20240307 -> claude-haiku-4-5-20251001
- claude-3-5-sonnet-20240620/20241022 -> claude-sonnet-4-5-20250929
- claude-3-7-sonnet-20250219 -> claude-sonnet-4-6
- gemini-1.5-flash -> gemini-2.5-flash
- gemini-1.5-pro -> gemini-2.5-pro

Explicitly preserved (not modernized):
- llm_translation/ tests where model is the SUBJECT (provider-specific
  translation/transformation logic). Only the deprecated 20250514
  references were already bumped in a prior commit.
- Cost-calc / tokenizer subject tests in test_utils.py (skip-ranges
  documented by the sub-agent).
- Bedrock model IDs in test_health_check.py path-stripping tests.
- JSONL batch request bodies and mock LLM response bodies (must match
  upstream literal).
- Langfuse expected-request-body JSON fixtures (cost values are exact-
  match-asserted; changing the model would shift response_cost).
- gpt-3.5-turbo-instruct (text-completion endpoint; no modern OpenAI
  equivalent).
- Top-level tests calling the proxy through user-facing aliases
  (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, dall-e-3) — aliases
  in proxy_server_config.yaml stay; only the underlying model was
  bumped.
- tests/test_gpt5_azure_temperature_support.py (the test's whole point
  is model-name handling).
- Fake / mock / openai/fake identifiers.

Notable side fixes:
- test_spend_accuracy_tests.py: UPSTREAM_MODEL now matches what
  spend_tracking_config.yaml's proxy actually routes to (gpt-5-mini),
  resolving a latent inconsistency.
- proxy_server_config.yaml: bare `gpt-5` alias renamed to `gpt-5.5`
  (bare gpt-5 is not a valid OpenAI alias).
- test_batches_logging_unit_tests.py: explicit_models list entries
  kept distinct (gpt-5-mini + gpt-5.5) after bulk rename.

* test: fix CI failures from model modernization sweep

CI surfaced 4 categories of regression from the bulk modernization:

1. Azure deployment names are customer-specific. Reverted:
   - tests/litellm_utils_tests/test_health_check.py: azure/text-
     embedding-3-small -> azure/text-embedding-ada-002 (the CI Azure
     account does not have a text-embedding-3-small deployment).
   - tests/logging_callback_tests/test_custom_callback_router.py:
     same revert for two router fixtures driving aembedding.

2. gpt-5 family does not accept temperature != 1. Tests that pass a
   custom temperature swapped from gpt-5-mini to gpt-4.1-mini (modern
   non-reasoning OpenAI mini that still accepts temperature/logprobs):
   - tests/logging_callback_tests/test_datadog.py
   - tests/logging_callback_tests/test_langsmith_unit_test.py
   - tests/logging_callback_tests/test_otel_logging.py

3. proxy_server_config.yaml's gpt-3.5-turbo-large alias was routing to
   gpt-5.5 (a reasoning model that rejects logprobs). The proxy test
   tests/test_openai_endpoints.py::test_chat_completion_streaming
   exercises logprobs/top_logprobs through that alias. Bumped the
   underlying model to gpt-4.1 (non-reasoning, still modern).

4. tests/logging_callback_tests/test_gcs_pub_sub.py asserts against a
   pinned JSON fixture (gcs_pub_sub_body/spend_logs_payload.json) with
   hardcoded model="gpt-4o" and a model-specific spend value. Reverted
   the litellm.acompletion calls in the test to model="gpt-4o" so the
   fixture's exact-match assertions still hold.

5. tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py:
   anthropic.messages.create routing to openai/gpt-5-mini returned an
   empty content[0] with max_tokens=100 (reasoning-token consumption).
   Swapped to openai/gpt-4.1-mini.

* test: fix Assistants API model + 2 cursor[bot] review nits

1. pass_through_unit_tests/test_custom_logger_passthrough.py: gpt-5.5
   isn't accepted by the /v1/assistants endpoint
   ("unsupported_model"). Switch to gpt-4.1-mini (modern, Assistants-
   API-supported, non-reasoning).

2. example_config_yaml/pass_through_config.yaml: the previous sweep
   bumped the claude-3-7-sonnet alias to claude-opus-4-7, which is a
   tier change (Sonnet -> Opus). Map to claude-sonnet-4-6 to keep the
   Sonnet tier intact. (Cursor bugbot review.)

3. example_config_yaml/simple_config.yaml: model_name was left as
   gpt-3.5-turbo while the underlying was bumped to gpt-5-mini, which
   muddles the "simple" example. Make both sides gpt-5-mini so the
   most basic example is a straight 1:1 mapping again. (Cursor bugbot
   review.)

* fix: revert gpt-4/gpt-3.5-turbo alias underlying to non-reasoning models

tests/test_openai_endpoints.py::test_completion calls the proxy alias
"gpt-4" with temperature=0, and other tests call gpt-3.5-turbo with
custom temperature / logprobs / the legacy /v1/completions endpoint.
The earlier modernization mapped both aliases to gpt-5.5 / gpt-5-mini,
which are reasoning models that reject temperature != 1 and don't
expose /v1/completions. Map the aliases to gpt-4.1 / gpt-4.1-mini
(modern non-reasoning OpenAI models) instead — keeps user-facing
aliases preserved while picking a current underlying that still
supports the parameters/endpoints the tests exercise.

* test(proxy): isolate run_server CLI tests from prisma DB-setup path

test_keepalive_timeout_flag and test_timeout_worker_healthcheck_flag
were the only run_server tests in test_proxy_cli.py that neither
stripped DATABASE_URL/DIRECT_URL nor mocked the prisma DB path. When a
DATABASE_URL is present (CI/env leak), run_server --local enters the DB
block and blocks in the un-timeout'd subprocess.run(["prisma"]) at
proxy_cli.py:987 plus the ProxyExtrasDBManager migrate-deploy retry
loops, ~370s per test on the CI runner. --dist=loadscope pins both to
one xdist worker, so the proxy-infra job appears stuck at 99% and hits
the 20-min timeout.

Apply the same isolation every other run_server test in this file
already uses: mock PrismaManager.setup_database +
should_update_prisma_schema and strip DATABASE_URL/DIRECT_URL. Full
module drops from 31.7s to 2.9s locally; both tests fall off the slow
list.

* feat: add OTEL GenAI latest-experimental semantic convention support (#27418)

- Introduce `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` opt-in that switches OTEL traces to conform with the OpenTelemetry GenAI semantic conventions specification
- Extract all semconv behavior into a new `OTELGenAISemconvMixin` class in `gen_ai_semconv.py`, mixed into `OpenTelemetry` to keep concerns separated
- In semconv mode, span name follows `{operation} {model}` pattern (e.g. `chat gpt-4`) and span kind is set to `CLIENT` instead of legacy `litellm_request`
- Replace `gen_ai.system` with `gen_ai.provider.name` and drop `llm.is_streaming` in semconv mode; add `gen_ai.request.{frequency_penalty,presence_penalty,top_k,seed,stop_sequences,stream,choice.count}` and `gen_ai.usage.cache_{creation,read}.input_tokens` attributes
- Replace per-message `gen_ai.content.prompt` / per-choice `gen_ai.content.completion` log events with a single consolidated `gen_ai.client.inference.operation.details` event; omit `gen_ai.input/output.messages` when content capture is disabled
- Suppress the non-standard `raw_gen_ai_request` child span entirely in semconv mode
- Support both programmatic (`OpenTelemetryConfig.semconv_stability_opt_in` field) and environment variable activation; the two sources are unioned so either or both can enable the opt-in
- Extract OTEL SDK `LogRecord` / `SeverityNumber` version-compatibility shim into a reusable `_otel_log_types()` static method to deduplicate the `< 1.39.0` / `>= 1.39.0` import branching
- Add 30+ unit tests covering opt-in gating, span naming, attribute emission/omission rules, stop sequence normalization, cache token attributes, and the consolidated event lifecycle

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* chore: retrigger CI

* test(ci): add reasoning_effort grid v4 e2e regression suite

Encode the 231-cell QA sweep (21 provider x model combos x 11 effort
values) from #27039 / #27074 as an automated CircleCI-gated regression
suite. Each cell hits the real provider endpoint, captures the outgoing
wire body via a pre-call CustomLogger, and asserts:

- thinking.type, output_config.effort, thinking.budget_tokens, max_tokens
  in the captured request body (regression signal for silent drops/strips
  in any provider transformation)
- HTTP status (200 vs BadRequestError -> 400) returned by litellm
  (regression signal for clean-error vs leaked-500 mappings)

The matrix is encoded as a small rule set keyed by (model_mode, effort)
plus per-model xhigh/max capability overrides, then expanded across the
five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex
AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke
/v1/messages route. Cells skip at runtime when the route's provider env
vars are absent, so PR builds without credentials no-op gracefully.

Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the
existing main / litellm_* branch filter.

* fix(reasoning_effort_grid_v4): cleanup unused fixture, parse converse body, guard budget tokens

- Remove unused vertex_credentials_path fixture (and now-unused os import)
  from conftest.py.
- Parse Bedrock Converse complete_input_dict (logged as a JSON string by
  converse_handler.py) before passing to _assert_cell, so dict accessors
  work uniformly across routes.
- Extend _BUDGET_TOKENS with xhigh and max entries so the budget-mode
  branch in expected() cannot KeyError if a future budget model gains
  the matching cap.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(reasoning_effort_grid_v4): grant sonnet-4-6 entries the max-effort cap

The runtime _validate_effort_for_model allows effort='max' for any
Claude 4.6 model (opus or sonnet), and model_prices_and_context_window
sets supports_max_reasoning_effort: true for claude-sonnet-4-6. The
grid spec previously gave sonnet-4-6 entries _CAPS_NONE, so expected()
returned status=400 for effort='max', which mismatched the runtime's
status=200 and caused 6 cells (one per route) to fail.

Rename _CAPS_OPUS_4_6 to _CAPS_4_6 (since the cap set is shared by
opus and sonnet 4.6) and assign it to all sonnet-4-6 entries.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(tests): move reasoning_effort grid suite under llm_translation, drop v4 naming

- Drop the "v4" suffix throughout: it referred to the QA sweep iteration,
  not this test suite. There's only one regression suite, so just call it
  reasoning_effort_grid.
- Move tests/test…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants