Skip to content

fix: route volcengine (Doubao) tiered-pricing models to the tiered cost handler#30357

Merged
Sameerlite merged 1 commit into
BerriAI:litellm_oss_170626_1from
vineethsaivs:fix-volcengine-tiered-pricing-cost
Jun 17, 2026
Merged

fix: route volcengine (Doubao) tiered-pricing models to the tiered cost handler#30357
Sameerlite merged 1 commit into
BerriAI:litellm_oss_170626_1from
vineethsaivs:fix-volcengine-tiered-pricing-cost

Conversation

@vineethsaivs

Copy link
Copy Markdown
Contributor

Relevant issues

Fixes #30346

Type

🐛 Bug Fix

Changes

Volcengine doubao models (for example volcengine/doubao-seed-2-0-pro-260215) define tiered_pricing in the model map but carry no flat input_cost_per_token/output_cost_per_token. The cost_per_token dispatcher had no volcengine branch, so these models fell through to generic_cost_per_token, which only reads flat per-token costs. The result was that every volcengine doubao request was tracked as $0

The dashscope cost calculator already implements graduated tiered pricing and reads tiered_pricing generically from the model map; the only provider-specific part was the hardcoded provider passed to get_model_info. This change makes that provider an argument (defaulting to dashscope, so existing behavior is unchanged) and routes volcengine to the same handler, mirroring how dashscope is dispatched

Added a regression test that asserts the graduated cost across two tiers for a doubao model; it fails on the current code (cost is $0) and passes with the fix

Screenshots / Proof of Fix

This path is network-free: cost is computed from the bundled model_prices_and_context_window_backup.json, so it does not require a live volcengine key. Reproduction on the local model map:

Before (current behavior):

volcengine/doubao-seed-2-0-pro-260215, prompt_tokens=10000, completion_tokens=2000
-> prompt_cost=$0.000000, completion_cost=$0.000000  (tracked as $0)

After this change:

volcengine/doubao-seed-2-0-pro-260215, prompt_tokens=10000, completion_tokens=2000
-> prompt_cost > 0, completion_cost > 0, matching the graduated tiered_pricing in the model map

The regression test test_volcengine_tiered_pricing_graduated_cost in tests/test_litellm/test_cost_calculator.py encodes the expected graduated cost across the first two tiers and verifies it end to end through cost_per_token(..., custom_llm_provider="volcengine")

@CLAassistant

CLAassistant commented Jun 13, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
6 out of 7 committers have signed the CLA.

✅ ryan-crabbe-berri
✅ Sameerlite
✅ yuneng-berri
✅ shivamrawat1
✅ mateo-berri
✅ vineethsaivs
❌ yassin-berriai
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes volcengine (Doubao) models being billed at $0 by adding a volcengine branch to the cost_per_token dispatcher that routes to the existing tiered-pricing handler, and parameterising that handler with a custom_llm_provider argument so it resolves model info for any provider.

  • litellm/cost_calculator.py: A new elif custom_llm_provider == "volcengine" branch imports and calls the shared tiered-pricing handler with custom_llm_provider="volcengine", exactly mirroring the existing dashscope branch.
  • litellm/llms/dashscope/cost_calculator.py: The cost_per_token function gains an optional custom_llm_provider parameter (default "dashscope") passed straight through to get_model_info; all existing callers are unaffected.
  • tests/test_litellm/test_cost_calculator.py: A network-free regression test (test_volcengine_tiered_pricing_graduated_cost) reads the bundled model map, constructs a cross-tier token count for doubao-seed-2-0-pro-260215, and asserts the exact graduated cost; uses monkeypatch.setattr for proper state cleanup.

Confidence Score: 5/5

Safe to merge — the change adds a new provider branch and a backward-compatible parameter; no existing behaviour is altered.

The dispatcher change is additive: it inserts a new elif that was previously missing, so no existing provider path is touched. The shared tiered-pricing function change is fully backward-compatible (new parameter defaults to dashscope). The regression test exercises the full cost-calculation path using the bundled model map, requires no live credentials, and uses monkeypatch for clean teardown. There are no data-path mutations, no auth changes, and no schema modifications.

No files require special attention.

Important Files Changed

Filename Overview
litellm/cost_calculator.py Adds a volcengine branch to the cost_per_token dispatcher that routes to the shared tiered-pricing handler, mirroring the existing dashscope branch; the lazy import avoids circular-import risk and the branch placement is consistent with all other provider branches in the file.
litellm/llms/dashscope/cost_calculator.py Adds an optional custom_llm_provider parameter (default "dashscope") to cost_per_token so the same tiered-pricing logic can resolve model info for any provider; the change is fully backward-compatible and the docstring is updated to reflect the broader purpose.
tests/test_litellm/test_cost_calculator.py Adds test_volcengine_tiered_pricing_graduated_cost, a network-free regression test that reads tier data from the bundled model map, constructs a cross-tier token count, and verifies the graduated cost exactly; correctly uses monkeypatch.setattr (unlike many existing tests in the file that directly mutate litellm.model_cost).

Reviews (3): Last reviewed commit: "fix: route volcengine (Doubao) tiered-pr..." | Re-trigger Greptile

Comment thread tests/test_litellm/test_cost_calculator.py Outdated
Comment on lines +658 to +666
elif custom_llm_provider == "volcengine":
# Volcengine (Doubao) models share Dashscope's tiered-pricing structure
from litellm.llms.dashscope.cost_calculator import (
cost_per_token as tiered_cost_per_token,
)

return tiered_cost_per_token(
model=model, usage=usage_block, custom_llm_provider="volcengine"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Volcengine coupled to dashscope's cost calculator file

The volcengine branch imports from litellm.llms.dashscope.cost_calculator, which ties an unrelated provider to dashscope's internal module. Any future dashscope-specific changes to that file (e.g., dashscope-flavoured token handling) could silently affect volcengine cost calculations. Consider either creating a thin litellm/llms/volcengine/cost_calculator.py that re-exports the shared logic, or extracting the reusable tiered-pricing function into a provider-neutral location (e.g., litellm/llms/utils/tiered_cost_calculator.py) that both dashscope and volcengine import from.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@vineethsaivs vineethsaivs force-pushed the fix-volcengine-tiered-pricing-cost branch from be8cd02 to c0c062c Compare June 15, 2026 01:18
@vineethsaivs

Copy link
Copy Markdown
Contributor Author

Good catches, thanks. I've switched the test to monkeypatch.setattr(litellm, "model_cost", ...) so it no longer leaks global state into later tests, matching the other tests in this file.

On the cross-provider coupling: I kept the change minimal by reusing the existing tiered-pricing calculator via a custom_llm_provider argument (defaulting to dashscope, so existing behavior is unchanged) rather than duplicating the graduated-tier logic. The function only reads tiered_pricing from the model map, so it's already provider-agnostic. Happy to relocate it to a provider-neutral module (or add a thin volcengine/cost_calculator.py wrapper) if you'd prefer that boundary; just let me know which you'd like.

@Sameerlite

Copy link
Copy Markdown
Collaborator

Thanks for the fix — the before/after cost output in the description is great proof! The Greptile review is a bit stale (commits landed after the last review). Triggering a fresh pass:

@greptileai

@vineethsaivs vineethsaivs force-pushed the fix-volcengine-tiered-pricing-cost branch from c0c062c to 57fbdbb Compare June 17, 2026 02:52
@vineethsaivs vineethsaivs changed the base branch from litellm_internal_staging to litellm_oss_branch June 17, 2026 02:53
@vineethsaivs vineethsaivs requested a review from a team June 17, 2026 02:53
@vineethsaivs

Copy link
Copy Markdown
Contributor Author

Retargeted this PR onto litellm_oss_branch per the contribution guard and rebased it, so the diff is just the three intended files (the volcengine dispatch branch in cost_calculator.py, the custom_llm_provider argument on the shared tiered-pricing handler, and the regression test). @greptileai

@Sameerlite

Copy link
Copy Markdown
Collaborator

Thanks for the contribution! A few things to get this ready for review:

  • Wrong base branch: This PR targets litellm_oss_branch but community PRs should target litellm_internal_staging. Could you rebase?

    git fetch origin
    git rebase --onto origin/litellm_internal_staging origin/litellm_oss_branch <your-branch>
    git push --force-with-lease
    

    Then update the base in GitHub's UI (Edit → Base: litellm_internal_staging).

  • Unresolved Greptile review threads — there are open threads from Greptile's review that haven't been resolved yet. Could you address those?

Once those are addressed, we'll take a closer look — thanks again!

…st handler

Volcengine (Doubao) models define `tiered_pricing` but no flat per-token cost, so cost_per_token fell through to generic_cost_per_token (which only reads flat costs) and tracked them at $0

Route custom_llm_provider == "volcengine" to the shared tiered-pricing handler in litellm/llms/dashscope/cost_calculator.py, which already computes graduated tier costs. Make that handler provider-agnostic by adding a custom_llm_provider argument (default "dashscope" preserves existing behavior) so get_model_info resolves the correct model map entry

Fixes BerriAI#30346
@vineethsaivs vineethsaivs force-pushed the fix-volcengine-tiered-pricing-cost branch from 57fbdbb to 2cd483b Compare June 17, 2026 04:08
@vineethsaivs vineethsaivs changed the base branch from litellm_oss_branch to litellm_internal_staging June 17, 2026 04:09
@vineethsaivs

Copy link
Copy Markdown
Contributor Author

Done. Rebased onto litellm_internal_staging and updated the base.

On the Greptile threads: the global-state one is resolved, the test now uses monkeypatch.setattr(litellm, "model_cost", ...) so it no longer leaks state into later tests. On the dashscope coupling: the shared cost_per_token only reads tiered_pricing from the model map and takes custom_llm_provider as a parameter, so it is already provider-agnostic; routing volcengine through it avoids duplicating the graduated-tier logic. I am happy to extract it into a provider-neutral module (e.g. litellm/llms/utils/tiered_cost_calculator.py) that both dashscope and volcengine import from if you prefer that boundary; I kept it in place for now since moving it would also relocate the private tier helpers and widen the diff. @greptileai

@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Thanks for addressing both threads and the clear explanation.

The global-state fix is correct — monkeypatch.setattr is the right approach and matches the pattern used elsewhere in the file.

On the dashscope coupling: your reasoning holds. The function is already provider-agnostic in behavior — get_model_info is parameterized, the tier math is generic, and the dashscope module path is purely a naming artifact. Extracting to litellm/llms/utils/tiered_cost_calculator.py would be the cleaner long-term home if a third provider ever needs to share it, but it's not necessary to block this fix. The diff is already minimal and the regression test gives a clear signal if the shared function ever diverges.

LGTM — the fix is correct, the test is solid, and the approach is reasonable given the scope of the change.

@Sameerlite Sameerlite changed the base branch from litellm_internal_staging to litellm_oss_170626_1 June 17, 2026 12:07
@Sameerlite Sameerlite merged commit 172e302 into BerriAI:litellm_oss_170626_1 Jun 17, 2026
3 checks passed
Sameerlite added a commit that referenced this pull request Jun 17, 2026
@vineethsaivs

Copy link
Copy Markdown
Contributor Author

I noticed this was merged into litellm_oss_170626_1 and then reverted there. If the revert was due to something on my end (a failing test or an integration conflict in the batch), I am happy to dig in and re-submit against the right branch; just let me know what you saw and I will turn it around quickly.

mateo-berri pushed a commit that referenced this pull request Jun 18, 2026
* fix(proxy): allow non-admin virtual keys to call GA Realtime WebRTC HTTP routes (#30089)

* fix(proxy): allow non-admin virtual keys to call GA Realtime WebRTC HTTP routes

Add the realtime WebRTC HTTP sub-routes (/realtime/client_secrets,
/realtime/calls and their /v1 + /openai/v1 variants) to
LiteLLMRoutes.openai_routes so is_llm_api_route() classifies them as
LLM API routes. Without this, non-admin virtual keys received
401 'Only proxy admin can be used to generate, delete, update info
for new keys/users/teams' when calling these endpoints.

Fixes #29923

* fix(proxy): validate session.model for realtime routes in model-access check

The GA Realtime WebRTC HTTP routes resolve the effective model from the
nested session.model (falling back to the top-level model), but the auth
layer's get_model_from_request() only extracted the top-level model. A
model-restricted virtual key could therefore place a disallowed model in
session.model, leave the top-level model unset, and skip can_key_call_model()
entirely - obtaining an ephemeral token for a model it is not allowed to use.

Extract session.model for the realtime client_secrets/calls routes so the
model-access check runs against the model the request will actually use.
Legitimate callers are unaffected; their permitted model still validates.

Relates to #29923

* fix(proxy): classify realtime transcription_sessions routes as LLM API routes

Add the GA Realtime WebRTC transcription_sessions HTTP routes to
openai_routes so is_llm_api_route() returns True for them, matching the
client_secrets and calls routes already fixed. These endpoints are
registered with user_api_key_auth in realtime_endpoints/endpoints.py, so
without this a non-admin virtual key calling
POST /v1/realtime/transcription_sessions would hit the admin-only 401
branch. Extends the regression test parametrization accordingly.

---------

Co-authored-by: habonlaci <4699494+habonlaci@users.noreply.github.com>

* feat(proxy): surface max_input_tokens/max_output_tokens on /v1/models (#30272)

* feat(proxy): surface max_input_tokens/max_output_tokens on /v1/models

* fix(proxy): degrade /v1/models gracefully when model-group lookup fails

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix: sort tiered token-cost thresholds numerically (#30375)

* fix: sort tiered token-cost thresholds numerically

_get_token_base_cost iterated input_cost_per_token_above_<N>_tokens keys with a
lexicographic sort, so for tiers whose thresholds have different digit lengths
(e.g. 90k vs 128k) a request crossing both was billed at the lower tier that
sorted first. Sort by the parsed numeric threshold instead, so the highest tier
the request actually crosses is applied.

* refactor: reuse _parse_above_token_threshold for inline threshold parse

---------

Co-authored-by: Eric (GabiDevFamily) <271972409+santino18727-debug@users.noreply.github.com>

* fix(openai): preserve cache_control for openai-compatible custom endpoints (#30387)

* fix(openai): preserve cache_control for openai-compatible custom endpoints

* fix(openai): use parsed hostname to detect real OpenAI for cache_control preservation

* fix(proxy): drain all daily-spend batches per flush cycle (#30281) (#30505)

* fix(types): prevent internal parallel_request_limiter fields from leaking to upstream providers (#30545)

* fix(types): add internal parallel_request_limiter fields to all_litellm_params to prevent forwarding to upstream providers

* test(types): add regression test for internal rate-limit fields in all_litellm_params

* fix(init): add bool type annotation to suppress_debug_info (#30531)

Module-level `suppress_debug_info = False` had no annotation, so strict
type checkers (e.g. ty) infer it as `Literal[False]`. Reassigning it to
`True` (as done in proxy_server.py and router.py) then fails with an
invalid-assignment error. Annotate it as `bool` to match every other
flag in this module.

* fix: coalesce null aggregates in update_metrics for no-spend keys (#29945)

* feat(team_endpoints): add query parameter `key_limit` to `/team/info` endpoint (#30006)

* feat(team_endpoints): Add query parameter key_limit to /team/info

* feat(team_endpoints): update schema.d.ts to include the new query parameter

* feat(team_endpoints): add tests for limitting key count in /team/info response

* feat(team_endpoints): Apply suggestions from greptile

* Set greater-than constraint on key-limit
* Fix type

* fix(router): release aiohttp connection when stream iteration ends abnormally (#30271)

* fix(router): release aiohttp connection when stream iteration ends abnormally

A streaming response that terminates with a mid-stream read timeout, a task
cancellation (client disconnect), or GeneratorExit never closed the underlying
aiohttp ClientResponse. aiohttp only auto-releases the connector slot at body
EOF, so each abnormally terminated stream permanently leaked one slot from the
shared TCPConnector pool. During a backend traffic spike the pool drains; once
exhausted every subsequent request to that host waits for a slot, times out
and surfaces as a 408, indefinitely, even after the backend recovers. Only a
proxy restart cleared the in-memory sessions, which matched the reported
symptom of a router stuck returning 408 for a healthy vLLM backend.

Close the response in a finally clause when iteration ends. On a fully read
response the connection was already released at EOF and close() is a no-op,
so keep-alive reuse for normal requests is unchanged.

Fixes #30192

* test(aiohttp): cover GeneratorExit path with a mock instead of a live socket

The previous slot-release test started a real aiohttp TCP server, which can
flake in offline CI and does not exercise this fix's code path directly.
Replace it with a dependency-injected mock that closes the stream generator
(GeneratorExit) and asserts the response is closed, covering the third
abnormal-exit path the finally block handles

* feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery (#30273)

* feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery

* refactor(proxy): move Anthropic model-list formatter into llms/anthropic/common_utils

* fix(proxy): make model_list request param optional for direct callers

* feat(dashscope): add Responses API support (#30286)

* feat(dashscope): add Responses API support

DashScope's OpenAI-compatible endpoint serves /responses, so register a
DashScopeResponsesAPIConfig that routes dashscope/* responses calls to
{api_base}/responses without rewriting the upstream model id, instead of
falling back to the chat-completions -> responses emulation pipeline.

Closes #29780

* feat(dashscope): mark responses API as not supporting native websocket

Matches the hosted_vllm/perplexity/openrouter responses configs, which all
override supports_native_websocket() to False since the OpenAI-compatible
endpoint has no native wss:// responses transport.

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(spend-logs): preserve error_message on ProxyException failures (#30381)

* fix(spend-logs): preserve error_message on ProxyException failures

`StandardLoggingPayloadSetup.get_error_information` used
`str(original_exception)` to populate the human-readable error message
stored in `spend_logs.metadata.error_information.error_message`.

`ProxyException` (litellm/proxy/_types.py:3453) sets `self.message` in
its constructor but does NOT call `super().__init__(message)` and does
NOT define `__str__`. As a result, `str(ProxyException(...))` returns
the empty string, and every auth/budget/quota rejection was landing
in spend_logs with `error_message=""` despite a fully populated
traceback.

Operator impact: dashboard "LLM Failure" rows became untriageable —
the only way to tell a 401 from a 429 was to manually unpack the
traceback JSON via psql. Burst failure patterns (e.g. a UI session
polling with a stale token) produced 20-30 indistinguishable
`error_code=401` rows per second.

Fix: prefer the `.message` attribute (set by ProxyException and every
litellm.exceptions.* class) over `str(exc)`. The `str(exc)` fallback
is retained for non-litellm exception types, preserving prior behavior.

Test plan:
  - 2 new unit tests in tests/test_litellm/litellm_core_utils/
    test_litellm_logging.py:
    * test_get_error_information_prefers_message_attribute_over_str
    * test_get_error_information_falls_back_to_str_when_no_message_attr
  - Existing test_get_error_information_error_code_priority still passes
  - End-to-end verified: bad-key 401 now stores full
    "Authentication Error, Invalid proxy server token passed..."
    message in spend_logs.metadata.error_information.error_message

* fix(spend-logs): preserve explicit empty .message + drop dead reference

Greptile P2 on #30381. The truthiness check `if message_attr:`
silently skipped an explicit empty-string `.message` and fell
through to `str(original_exception)`. For ProxyException-shaped
objects both produce empty, so the bug was latent; for other
exception types it would inject a different string into
error_information.error_message and corrupt the signal.

Use `is not None` so an empty string survives verbatim.

Also drop the stale `See e2e/cases/11.` comment reference — that
path does not exist anywhere in the repo and confuses future
readers.

Regression test added: an exception with `.message=""` and a
non-empty `super().__init__()` arg must yield error_message == "".

* ci: retrigger workflows after base branch change to litellm_internal_staging

* fix(anthropic): strip LiteLLM-injected total_tokens from /v1/messages response (#30382)

* fix(anthropic): strip LiteLLM-injected total_tokens from /v1/messages response

The non-streaming /v1/messages response carries a LiteLLM-injected
usage.total_tokens = input_tokens + output_tokens that is not part of
the Anthropic API spec. This caused three problems:

1. Shape divergence with streaming on the same endpoint.
   message_delta.usage in the SSE path never carries total_tokens.
   Clients parsing both paths get two different schemas from one endpoint.

2. Shape divergence with upstream. Direct calls to
   https://api.anthropic.com/v1/messages return no total_tokens field,
   so clients using the official Anthropic SDK couldn't rely on it,
   and clients that did rely on the LiteLLM-injected one broke when
   bypassing the proxy.

3. Numerical misuse. total = input + output undercounts when
   cache_read_input_tokens and cache_creation_input_tokens are
   non-zero, because cache tokens are reported in their own fields.
   A 100k-token cached prompt with 1 non-cache input token + 200
   output tokens reports total_tokens = 201, off by ~99.8% from any
   reasonable definition of "total."

Fix: add _strip_total_tokens_from_anthropic_response in
litellm/proxy/anthropic_endpoints/endpoints.py and invoke it in the
success path of anthropic_response right before returning. Only mutates
dict-shaped responses; streaming (which already lacks the field) is
left untouched.

spend_logs / Prometheus continue to compute total_tokens internally
for billing — this fix only strips the field from the wire response.

Scope: only the Anthropic passthrough endpoint /v1/messages. The
OpenAI-shape /v1/chat/completions is unaffected.

* fix(anthropic): gate total_tokens strip behind flag + handle Pydantic .usage

Two P1 greptile threads on #30382:

P1 — **Backwards-incompatible removal without a feature flag**
  Stripping `usage.total_tokens` unconditionally breaks any client
  currently reading the LiteLLM-shaped non-streaming /v1/messages
  response. Per the codebase's policy (mirrors #30418), gate behind
  a new flag.

  - `litellm.strip_anthropic_total_tokens: bool = False` (default —
    backward-compat: clients keep seeing total_tokens).
  - Env override: `LITELLM_STRIP_ANTHROPIC_TOTAL_TOKENS=true`.
  - Docstring: planned to flip to True in a future major release;
    opt in early.

P1 — **Silent no-op if `result` is a Pydantic model**
  `base_process_llm_request` may return a Pydantic-style object
  whose `.usage` is a plain dict (the most common shape — e.g.
  objects wrapping raw upstream JSON). The original
  `isinstance(response, dict)` guard skipped strip on those, so
  `total_tokens` would still hit the wire. Helper now also reads
  `getattr(response, "usage", None)` and strips when that's a dict.

  Strongly-typed Pydantic `Usage` sub-models with required
  `total_tokens` fields are still skipped — those impose type
  constraints the helper doesn't try to subvert.

Tests:
- `test_strips_total_tokens_on_pydantic_model_with_dict_usage`
- `test_flag_defaults_off`
8/8 pass locally.

* fix(anthropic): drop env var for strip flag (docs CI)

Mirrors #30418's pattern (`expose_router_debug_in_errors: bool = True`,
no `os.getenv`). The `LITELLM_STRIP_ANTHROPIC_TOTAL_TOKENS` env var
introduced in the prior commit was flagged by
`tests/documentation_tests/test_env_keys.py` because the documentation
file `docs/my-website/docs/proxy/config_settings.md` lives in
`BerriAI/litellm-docs` (separate repo) and registering a new env key
requires a parallel docs PR — a friction we avoid here by exposing
the flag only as a Python attribute + `litellm_settings` config key,
both of which load through the existing proxy config plumbing without
needing the env-var registry to be updated.

No semantic change: default still False, behavior identical when set
via `litellm.strip_anthropic_total_tokens = True` or
`litellm_settings.strip_anthropic_total_tokens: true` in config.yaml.

Verified locally: env scan no longer surfaces the key; 8/8 tests pass.

* ci: retrigger workflows after base branch change to litellm_internal_staging

* fix(pricing): correct swapped input/output token costs for command-r7b-12-2024 (#30413)

* fix(pricing): correct swapped input/output token costs for command-r7b-12-2024

* test: resolve model prices JSON relative to test file for pip installs

* fix(exception-mapping): map Gemini upstream-error body code 429 to RateLimitError (#30417)

* fix(exception-mapping): map Gemini upstream-error body code 429 to RateLimitError

Some Gemini-compatible gateways (e.g. new-api) wrap a 429 rate-limit
signal from upstream inside an HTTP 500/503 envelope, with the real
code only surfaced in the JSON body:

    {"error":{"message":"...high demand...","type":"upstream_error",
              "param":"","code":429}}

Previously LiteLLM only looked at the HTTP status and mapped this to
InternalServerError, which Router treats as non-retryable for many
configs — so users got hard 500s instead of fallback/retry.

Now the Gemini/Vertex exception mapper parses error.code from the body
and routes code 429 to RateLimitError before falling through to the
HTTP-status branches. Other body codes fall through unchanged.

Tests cover:
- new-api gateway's `code:429` payload now maps to RateLimitError
- Genuine 500-body responses stay InternalServerError
- Non-JSON body strings fall through to status-code mapping unchanged

* fix(exception-mapping): scope body-code 429 promotion to 5xx envelopes

Addresses greptile P1/P2 + @Sameerlite's review on #30417. The new
elif branch was firing for any HTTP status, so a gateway response of
HTTP 400 with body {"error":{"code":429,...}} would be incorrectly
promoted to RateLimitError (retryable) instead of falling through
to BadRequestError. Same trap for 401 -> AuthenticationError.

Scoped the body-code 429 check to `500 <= status_code < 600` —
covers 500/502/503/504 (gateways wrapping upstream 429 in any 5xx
envelope) without inviting the 4xx misclassification.

Tests: parametrized table now covers 5xx (500/502/503), 4xx (400/401),
and the existing fall-through cases, asserting each maps to the
exception type that matches the HTTP status code. 50/50 pass locally.

* ci: retrigger workflows after base branch change to litellm_internal_staging

* feat(router): add expose_router_debug_in_errors flag (default True) to redact internal model_group/fallback names (#30418)

* feat(router)!: redact internal model_group/fallback names from exception messages

The Router was unconditionally appending internal config names onto
exception.message:
  - "Received Model Group=..."
  - "Available Model Group Fallbacks=..."
  - "No fallback model group found... Fallbacks={...}"
  - "context_window_fallbacks={...}"
  - Deployment-timeout messages including model_group
  - Fallback failure detail listing fallback chain

ProxyException forwards .message verbatim to clients, so gateways were
leaking their model_name / fallback wiring in every failed call.

Fix: gate all five mutation sites on a new
`litellm.expose_router_debug_in_errors` flag (default False). Set to
True to restore upstream debug behavior for local debugging.

Why: matches the redaction posture this codebase already has for
upstream model identifiers (cf. _litellm_returned_model_name) and
removes the last common error-path leak of internal model_group names.

Breaking change marker (!): if anything parses "Received Model Group="
out of client error messages, flip the flag on or migrate to the
x-litellm-* response headers instead.

Tests: 7 cases covering each of the 5 redaction sites + the flag-on
inverse path, plus a "default off" sanity check.

* test(router): cover sites 1 + 3 of expose_router_debug_in_errors gate

Addresses Greptile / codecov feedback on #30418: patch coverage was
55.6% with 4 lines uncovered in litellm/router.py. The existing tests
exercised sites 2 (ContextWindowExceededError), 4 (no-fallback-found),
and 5 (Received Model Group) — both default and flag-on. Sites 1 and 3
were declared in the PR description as covered by "site 5 also fires"
but the gate body lines for each (the `e.message +=` inside the
`if litellm.expose_router_debug_in_errors:` branch) only execute when
the flag is on AND the specific exception path is taken, which neither
existing test triggered.

Added 4 new tests (default + flag-on × 2 sites):

  - test_default_does_not_leak_deployment_timeout_debug
  - test_flag_on_leaks_deployment_timeout_debug
  - test_default_does_not_leak_content_policy_fallback_hint
  - test_flag_on_leaks_content_policy_fallback_hint

Trigger details:

  - Site 1 (litellm.Timeout in _acompletion) is reached via the
    Router-supported `mock_timeout=True` + `timeout=0.001` kwargs on
    `acompletion(...)`. Cannot embed a Timeout instance in model_list
    because Router.__init__ deep-copies it and Timeout.__reduce__ does
    not preserve the required positional args.
  - Site 3 (ContentPolicyViolationError without content_policy_fallbacks
    set, in async_function_with_fallbacks_common_utils) is reached by
    passing a `mock_response=litellm.ContentPolicyViolationError(...)`
    instance via the call-site kwarg — same deepcopy-avoidance reason.

11/11 tests pass locally. Patch coverage on litellm/router.py for this
PR's diff should now be 100%.

* chore(router): flip expose_router_debug_in_errors default to True

Addresses @Sameerlite's review on #30418 — maintain backward
compat on the wire. Redact becomes opt-in via setting the flag
to False; the historical behavior (leak internal model_group /
fallback wiring through exception messages) is preserved as the
default.

- litellm/__init__.py: default flipped to True, docstring rewritten
  with deprecation note pointing at a future flip to False (redact
  by default) in a major release.
- tests/test_litellm/test_router_exception_redaction.py: fixture
  resets to True (was False); the "off" tests now explicitly set
  False; the "default_leaks_*" tests rely on the fixture default.
  test_flag_defaults_off -> test_flag_defaults_on.
- No router.py change needed; the gate keys off the same flag,
  only the default changes.
- PR title no longer needs the breaking-change `!` marker — no
  client sees a behavior change at default settings.

11/11 pass locally.

* ci: retrigger workflows after base branch change to litellm_internal_staging

* feat(guardrails): integrate Repelloai Argus guardrail (#30465)

* feat(guardrails): add RepelloAI Argus guardrail integration (#1)

* feat(guardrails): add RepelloAI Argus guardrail integration

Add a new guardrail hook backed by RepelloAI Argus, with dashboard-managed
asset policies enforced via an asset_id and X-API-Key auth.

* fix(guardrails): harden RepelloAI Argus guardrail

- scan streaming responses on output (was bypassing the guardrail)
- log blocked verdicts as guardrail_intervened instead of success
- treat auth/config errors (401/403/404/422) as misconfiguration that
  always blocks, not a fail-open-able unreachable error
- default unreachable_fallback to fail_closed and read it directly;
  block on unknown/malformed verdicts so an API change can't silently
  disable enforcement
- type unreachable_fallback as a Literal, drop the duplicate config model,
  expose unreachable_fallback in the config schema, and stop leaking the
  raw provider response / exception strings to the client

* fix(guardrails): address RepelloAI Argus review feedback

- support ARGUS_API_KEY (with REPELLOAI_API_KEY fallback)
- make asset_id required in the config model
- normalize unreachable_fallback so only fail_open opens; block on 400 misconfig
- correct the shared unreachable_fallback field description

* docs(guardrails): add RepelloAI Argus docs page and dashboard listing

- add docs page covering config, env vars, modes, verdicts, failure semantics
- list RepelloAI Argus in the Guardrail Garden with provider/logo mappings
- add a regression test for the provider logo and display-name resolution

* fix(guardrails): keep RepelloAI asset_id optional in config model

A required asset_id leaked onto the shared LitellmParams (which inherits
RepelloAIGuardrailConfigModel), breaking validation for every other
guardrail. Keep it optional like sibling models; the guardrail __init__
still raises when asset_id is missing, which is the real enforcement.

* Add comment for last user turn scanning

* feat(guardrails): harden repelloai scanning

* feat(guardrails): expand repelloai scanning to include tool definitions

Add extraction of tool definitions and tool call arguments to the RepelloAI
guardrail scanning. Improves detection coverage by including function schemas
and parameters in the prompt sent to the guardrail service. Also captures
detailed error responses in logs and adds guardrail header to streaming responses.

* refactor(guardrails): fix and harden repelloai schema text extraction

- Fix duplicate text in _iter_schema_text: previously all dict values were
  re-queued onto the stack even after scalar/list keys were already extracted
  explicitly, causing names/descriptions to appear twice in the scanned prompt
- Extract schema key frozensets to module-level constants so they are not
  reconstructed on every call
- Change _iter_schema_text from @classmethod to @staticmethod (cls unused)
- Narrow _call_analyze stage param from str to Literal["prompt", "response"]
- Add HttpxResponse type annotation to _raise_for_config_error
- Add LLMResponseTypes annotation to async_post_call_success_hook response param

* fix(guardrails): resolve pyright type errors in repelloai guardrail

- Narrow async_handler.post return from Response|None to Response with
  explicit None guard before calling raise_for_status/json
- Fix list comprehension returning str|None by switching to explicit loop
  with isinstance guard so pyright tracks the narrowing
- Cast model_dump() result to Dict since hasattr does not narrow object
  type in pyright

* fix(guardrails/repello): include Responses API instructions field in prompt scan

The /v1/responses top-level `instructions` field was not included in
_extract_prompt_text, allowing a caller to bypass guardrail policy checks
by putting blocked content in `instructions` while keeping `input` benign.

* feat: add api_key to config model and read prompt from data dict

* fix(guardrails/repello): plug input_text and tool-call response bypass gaps

Responses API input content parts with type 'input_text' were silently
dropped by build_inspection_messages (which only handles type='text'),
allowing callers to send blocked content via that path without triggering
the pre-call scan. Fix: add _extract_input_text_parts to RepelloAIGuardrail
and call it when walking the Responses API input messages.

Post-call scanning skipped responses whose choices contained only tool_calls
or function_call (message.content=None), letting models put blocked output in
function arguments undetected. Fix: _extract_chat_completion_text now calls
_extract_tool_call_args_from_message on each choice message.

Also replace typing.Dict/List with builtin dict/list to clear TID251 strict
ruff violations introduced by this file.

* fix(guardrails/repello): scan Responses API function_call output arguments

Output items with type 'function_call' in a /v1/responses response were
skipped by _extract_responses_api_text; only 'message' items were walked.
A model could return blocked content in function_call.arguments undetected.
Now extract arguments from function_call output items before scanning.

* fix(anthropic): drop orphaned server_tool_use on multi-turn replay from generic OpenAI clients (#30486)

* fix(anthropic): drop orphaned server_tool_use on multi-turn replay from generic OpenAI clients

When an Anthropic server-side tool (web_search, id `srvtoolu_...`) is used, its
result is carried in `provider_specific_fields.web_search_results` — PRs #17746
/ #17798 restore it for callers that round-trip provider_specific_fields. A
generic OpenAI client that does NOT preserve provider_specific_fields (e.g. Open
WebUI talking to a Vertex/Anthropic model over /chat/completions) drops it on
replay and instead sends back an assistant `tool_call` + a `tool` message both
keyed to the `srvtoolu_` id. The transform then produced a bare `server_tool_use`
(with no following *_tool_result) plus a user `tool_result` for the same id —
both invalid, so the next turn 400s:

  messages.N.content.0: unexpected `tool_use_id` found in `tool_result` blocks:
  srvtoolu_... Each `tool_result` block must have a corresponding `tool_use`
  block in the previous message.

This is the commonly-reported vertex_ai symptom where Gemini works but Claude
400s on the 2nd turn of a web-search chat.

Fix (litellm/litellm_core_utils/prompt_templates/factory.py):
- convert_to_anthropic_tool_invoke: only emit a server_tool_use when its matching
  *_tool_result is available to pair with it; otherwise skip it (a bare
  server_tool_use is itself rejected).
- anthropic_messages_pt: drop a replayed `tool`/`function` message whose
  tool_call_id starts with `srvtoolu_` (a server-executed tool produces no client
  result; a user tool_result for it is invalid).

The existing reconstruction path (provider_specific_fields present, e.g. the
litellm SDK) is unchanged, as is regular client tool_use/tool_result.

Tests (tests/llm_translation/test_prompt_factory.py):
- update test_convert_to_anthropic_tool_invoke_server_tool ->
  test_convert_to_anthropic_tool_invoke_server_tool_without_result_is_dropped
- add test_anthropic_messages_pt_generic_client_drops_orphan_server_tool

Follow-up to #17746 / #17798; addresses the generic-client (no
provider_specific_fields) case of #17737.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(anthropic): cover the srvtoolu_ round-trip fix in the test_litellm unit suite

The regression tests added in tests/llm_translation/test_prompt_factory.py aren't
run by the coverage CI job (it runs tests/test_litellm), so the new factory.py
branches showed as uncovered (codecov patch coverage). Add equivalent focused
tests in the unit suite so both new branches are exercised there:
- convert_to_anthropic_tool_invoke drops a srvtoolu_ server_tool_use when no
  matching *_tool_result is available.
- anthropic_messages_pt drops the orphaned srvtoolu_ tool message a generic
  OpenAI client replays.

Refs #17737

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(anthropic): cover the server_tool_use + result valid-pair path in unit suite

Covers the remaining patch-coverage lines codecov flagged: convert_to_anthropic_tool_invoke
emitting server_tool_use followed by its web_search_tool_result when the matching
result is present (the litellm-SDK round-trip path). Refs #17737

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(anthropic): flatten srvtoolu_ tool-message guard to a negated if

Addresses the Greptile style nit: replace the if-pass/else with a single negated
`if not (...)` guard around the tool_result append. Behavior unchanged. Refs #17737

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(proxy): require premium only when enabling premium metadata fields (#30285) (#30506)

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(perplexity): stop double-billing reasoning tokens in manual cost fallback (#30488)

* fix(perplexity): stop double-billing reasoning tokens in manual cost fallback

When perplexity_cost_per_token cannot use the API-provided usage.cost.total_cost short-circuit and falls back to manual calculation, it multiplies the full usage.completion_tokens by output_cost_per_token and then adds reasoning_tokens * output_cost_per_reasoning_token on top. Per the OpenAI/Perplexity usage convention codified for the central path in PR #18607, completion_tokens already INCLUDES reasoning_tokens, so the manual fallback double-bills reasoning at both the output and reasoning rate.

Concrete impact on perplexity/sonar-deep-research (input 2e-6, output 8e-6, reasoning 3e-6): for the exact usage shape exercised by the live response fixture in tests/llm_translation/test_perplexity_reasoning.py (prompt_tokens=9, completion_tokens=20, reasoning_tokens=15) the current code charges 0.000223 vs the convention-correct 0.000103, a 2.165x overcharge. The bug is reachable whenever Perplexity omits the cost object (streaming chunks, fixture-driven paths, older API versions).

Subtracts reasoning_tokens (clamped at zero) from completion_tokens before applying the output rate, mirroring how dashscope/cost_calculator.py and the central generic_cost_per_token already handle it. Preserves the existing fallback behaviour when output_cost_per_reasoning_token is unset (all completion_tokens stay at the output rate).

Existing tests in tests/test_litellm/llms/perplexity/test_perplexity_cost_calculator.py asserted the buggy math and are updated to the convention-correct math. Adds a focused regression test using the exact usage shape from the live response fixture so this class of bug cannot be silently reintroduced.

* style(perplexity): drop redundant type annotation on else branch to satisfy mypy

mypy [no-redef] flagged 'completion_cost' as declared in both if and else arms; keeping the annotation only on the first declaration matches existing patterns in this file.

* fix(perplexity): update integration test expected costs for non-double-billed math

Three tests in test_perplexity_integration.py asserted the old buggy expectation
that reasoning_tokens are billed in addition to the full completion_tokens
count. After the fix in cost_per_token, reasoning_tokens are billed at the
reasoning rate and the remaining (completion_tokens - reasoning_tokens) at the
standard output rate, matching OpenAI/Perplexity convention (PR #18607).

Updates: test_end_to_end_cost_calculation_with_transformation,
test_main_cost_calculator_integration, test_high_volume_cost_calculation.
The high-volume sanity threshold drops to 0.25 to reflect the corrected total.

* fix(ui): use dynamic proxy base URL in MCP usage examples (#30487)

Replace hardcoded http://localhost:4000 with getProxyBaseUrl() in the
MCP server usage example and copy-to-clipboard snippet so the generated
configuration works for non-local deployments.

Fixes #30466

* feat: add missing UK PII entity types to Presidio guardrail (#30537)

* feat: add missing UK PII entity types to Presidio guardrail

Add UK_PASSPORT, UK_POSTCODE, and UK_VEHICLE_REGISTRATION to PiiEntityType enum and PII_ENTITY_CATEGORIES_MAP. These entity types are supported by Microsoft Presidio but were missing from litellm's type definitions, preventing users from configuring UK-specific PII detection.

* test: remove fragile hardcoded entity count test

Remove test_uk_category_entity_count which hardcodes len() == 5. The test_uk_entities_match_presidio_recognizers test already verifies exact set equality, making the count test redundant and fragile to future Presidio additions.

* style: apply Black formatting to match CI requirements

* fix: route volcengine (Doubao) tiered-pricing models to the tiered cost handler (#30357)

Volcengine (Doubao) models define `tiered_pricing` but no flat per-token cost, so cost_per_token fell through to generic_cost_per_token (which only reads flat costs) and tracked them at $0

Route custom_llm_provider == "volcengine" to the shared tiered-pricing handler in litellm/llms/dashscope/cost_calculator.py, which already computes graduated tier costs. Make that handler provider-agnostic by adding a custom_llm_provider argument (default "dashscope" preserves existing behavior) so get_model_info resolves the correct model map entry

Fixes #30346

* feat(mcp): make MCP gateway name and description configurable via env vars (#30473)

* feat(mcp): make MCP gateway name and description configurable via env vars

* Rename function _restore_env to _apply_env

* docs(mcp): document import-time capture of env-backed identity constants

Address Greptile review feedback: clarify that LITELLM_MCP_SERVER_NAME and
LITELLM_MCP_SERVER_DESCRIPTION are read once at import and require a module
reload to observe env changes after import.

Generated with AI assistance

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Yevhen Luhovtsov <yevhen.luhovtsov@intapp.com>
Co-authored-by: Claude <noreply@anthropic.com>

* fix(mcp): preserve native tools in semantic filter hook (#26650)

* fix(mcp): preserve native tools in semantic filter hook

The SemanticToolFilterHook.async_pre_call_hook passed ALL tools (MCP +
native) to filter_tools(), which only knows MCP-registered tool names.
Native tools silently failed the name match in _get_tools_by_names()
and were dropped from the request.

Fix: partition tools into native and MCP-registered before filtering.
Run the semantic filter only on MCP tools, then merge native tools
back unconditionally.

Changes:
- Robust _is_mcp_tool() using shape-based detection for OpenAI-format
  dicts, safe regardless of future _extract_tool_info changes
- Single-pass partition loop (no double _is_mcp_tool calls)
- Preserve native tools in MCP expansion path (mixed requests)
- Track MCP expansion to prevent expanded tools bypassing filtering
- filter_stats reports MCP-only counts for accurate metrics
- Extracted _emit_filter_metadata() helper
- Skip spurious filter headers for all-native tool requests

Closes #26212

* remove stale docstring note referencing tools_expanded_from_mcp

* fix: handle Responses API name collision and preserve tool ordering

- Classify Responses API tools ({type: 'function', name: '...'}) as
  native to prevent name collisions with MCP canonical names
- Preserve original request tool ordering using id()-based merge
  instead of naive native+mcp concatenation
- Add 2 regression tests: name collision and ordering preservation

* style: apply black formatting

* fix(mcp): harden semantic filter — preserve all native tool formats, safe metadata access, graceful expansion failure, name-based merge

* lint: suppress PLR0915 on async_pre_call_hook (matches codebase convention)

* ci: retrigger checks after rebase onto litellm_internal_staging

* feat(fireworks): sync Fireworks AI model registry with current platform catalog (#30616)

Adds 12 new Fireworks serverless models and updates 3 existing entries in
model_prices_and_context_window.json and its bundled backup to match the
current Fireworks platform model list. New direct models: glm-5p2,
qwen3p7-plus, minimax-m3, minimax-m2p7, kimi-k2p7-code, kimi-k2p6,
deepseek-v4-pro, deepseek-v4-flash. New router endpoints: glm-5p1-fast,
kimi-k2p6-fast, kimi-k2p7-code-fast. Updated: glm-5p1, gpt-oss-120b, and
gpt-oss-20b now carry correct output token caps, cache-read pricing, and
explicit capability flags

max_tokens is set equal to max_output_tokens (not the full context window)
for models whose generation cap is below their context window. This avoids
the shared input+output budget path in get_modified_max_tokens, which would
otherwise let callers request output sizes the model cannot produce. The
same fix corrects the pre-existing glm-5p1, gpt-oss-120b, and gpt-oss-20b
entries that had max_tokens equal to the full context window

Short-form aliases (fireworks_ai/<model>) are added for every direct
accounts/fireworks/models/ entry so cost attribution works for callers
using bare model names. Router endpoints get short-form aliases too, and
transform_request now routes bare names ending in -fast to the
accounts/fireworks/routers/ path instead of defaulting every bare name to
models/. This keeps the kimi-k2p6-fast router from being misrouted to the
nonexistent models/kimi-k2p6-fast endpoint

kimi-k2p6-turbo is intentionally excluded; kimi-k2p6-fast is its
replacement. Context windows for deepseek-v4 and kimi models use the
power-of-two values (1048576 and 262144) published on the Fireworks model
pages, matching the convention already used by existing entries

Two regression tests in test_utils.py assert the exact per-token costs,
token limits, capability flags, and short-form-to-long-form equality for
all 15 models against both the main and backup cost maps. Two routing
tests in test_fireworks_ai_chat_transformation.py verify bare -fast names
route to routers/ and bare direct-model names route to models/

* fix(bedrock): handle role:"system" inside the messages array on /v1/messages (#29698) (#30443)

* feat(anthropic): hoist leading in-array system to top-level (helper)

* test(anthropic): cover _system_content_to_blocks edge cases; deepcopy cache_control

* test(anthropic): mid-conversation system normalization cases

* feat: add supports_mid_conversation_system flag to Claude Opus 4.8

Add supports_mid_conversation_system: true to all 9 claude-opus-4-8 cost-map
entries (Anthropic-native, Bedrock, Vertex, Azure AI) in both the root cost
map and the bundled package backup, since the runtime helper and tests read
the backup in local/offline mode.

Pin the mid-system passthrough regression test to the local cost map via the
existing local_model_cost_map fixture so it reads the branch-local flag rather
than the network-fetched main copy.

* fix(bedrock): normalize in-array system in /v1/messages handler (#29698)

Wire normalize_system_messages_for_anthropic into anthropic_messages_handler
so all Bedrock /v1/messages paths (Invoke / Mantle / ClaudePlatform /
Converse-bridge) hoist leading in-array system entries (and demote
mid-conversation ones on models lacking supports_mid_conversation_system) into
the top-level system field. The normalized messages/system are written back
into the local_vars snapshot the base_llm branch reads from, otherwise the
Invoke/Mantle fix would silently no-op.

Also fix the helper to resolve supports_mid_conversation_system through the
prefix-aware AnthropicModelInfo._supports_model_capability resolver. The raw
_supports_factory could not see the flag once get_llm_provider left the
invoke/ prefix on the model id, which would have wrongly demoted
mid-conversation system on a Bedrock invoke opus-4-8 path.

* fix(bedrock): resolve mid-conversation-system flag through mantle/invoke/converse route prefixes; drop unused param

* fix(types): widen system param to Union[str, List] for hoisted system blocks

* refactor(bedrock): drop dead local_vars messages writeback

* fix(bedrock/converse): translate in-array system in anthropic->openai adapter (#29698)

* fix(bedrock/converse): preserve cache_control on in-array system; test drop-empty

* fix(bedrock/converse): rename colliding local to satisfy mypy; test handler system-merge branches

* fix(types): register supports_mid_conversation_system in model-info schema

The cost-map JSON-schema validation test (test_aaamodel_prices_and_context_window_json_is_valid)
rejects unknown properties, so adding supports_mid_conversation_system to the opus-4-8
cost-map entries failed CI with 'Additional properties are not allowed'. Register the flag
in the INTENDED_SCHEMA allow-list and in the ProviderSpecificModelInfo TypedDict so it is a
typed, first-class capability flag alongside its peers (supports_output_config, etc.).

---------

Co-authored-by: Sameer Kankute <sameer@berri.ai>

* fix(bedrock/agentcore): optionally forward multimodal content blocks in InvokeAgentRuntime payload (#28885)

* fix(bedrock/agentcore): optionally forward multimodal content blocks in InvokeAgentRuntime payload

By default the agentcore provider flattens the last message to a text-only
{"prompt": "..."} payload via convert_content_list_to_str, silently dropping
OpenAI multimodal blocks (image_url, file, input_audio, ...).

This adds an opt-in `forward_multimodal_content` litellm param. When truthy and
the last message's content is a list containing a non-text block, the original
OpenAI content list is forwarded verbatim under a new "content" field so an
attachment-aware AgentCore agent can read it. Default off keeps the payload
byte-identical to the legacy {"prompt": "..."} shape — existing agents are
unaffected.

The flag is read from optional_params (where other AgentCore params land) with a
litellm_params fallback, and accepts a bool or a config/env string ('true', '1', ...).

AgentCore Runtime is schemaless on the agent side — the agent's @app.entrypoint
parses arbitrary JSON up to 100 MB (per
https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-invoke-agent.html),
so this is a purely upstream change; no AgentCore-side schema is asserted.

* fix(bedrock/agentcore): shallow-copy forwarded multimodal content list

Address review feedback (Sameerlite): payload["content"] = last_content
aliased the caller's mutable messages[-1]["content"] list. Harmless today
because the payload is JSON-serialized immediately, but a latent footgun if
a future caller mutates the returned payload before serialization. Forward
list(last_content) so the payload owns its own list. Block dicts stay shared
on purpose — a deep copy would clone potentially large base64 media on the
request hot path, and the flagged risk was the shared list, not the blocks.

Update the passthrough tests to assert equality + distinct identity, and add
a regression test that mutating the payload list can't leak back into the
original message content.

* Revert "fix(mcp): preserve native tools in semantic filter hook (#26650)"

This reverts commit 438c825.

* Revert "feat(guardrails): integrate Repelloai Argus guardrail (#30465)"

This reverts commit 54da785.

* Revert "feat(dashscope): add Responses API support (#30286)"

This reverts commit 6766256.

* Revert "fix(bedrock): handle role:"system" inside the messages array on /v1/messages (#29698) (#30443)"

This reverts commit b8a8083.

* Revert "fix(anthropic): drop orphaned server_tool_use on multi-turn replay from generic OpenAI clients (#30486)"

This reverts commit 6e9c0b0.

* Revert "fix: route volcengine (Doubao) tiered-pricing models to the tiered cost handler (#30357)"

This reverts commit 172e302.

* Revert "feat(proxy): serve Anthropic-native /v1/models for Claude Code gateway discovery (#30273)"

This reverts commit 4e31885.

* fix: pass key_limit=None in team_member_update and patch model_cost in pricing test

team_member_update called team_info without key_limit, so the fastapi.Query
default object (not None) was passed through to get_data, which failed when
serializing it. Pass key_limit=None explicitly to avoid this.

test_get_model_info_costs patched litellm.model_cost from the local backup so
the assertion holds before the PR is merged and the remote main URL is updated.

* fix(security): validate resolved model in /realtime/client_secrets for non-transcription sessions (#30710)

Omitting both model and session.model caused the endpoint to default to
gpt-4o-realtime-preview without running can_key_call_resolved_model, so
any key could access that model regardless of its allowed-model list.

The transcription path already called can_key_call_resolved_model; this
adds the same call for the realtime path before returning.

* fix(lint): fix F821 undefined model_info and F841 unused metadata in create_model_info_response

* fix: black formatting and stub get_model_group_info in third team translation test

* fix: reformat utils.py with black 26.3.1 to match CI

* fix: replace Optional[X] with X | None to satisfy UP045 ruff strict gate

---------

Co-authored-by: Habon Laszlo <habonlaci@users.noreply.github.com>
Co-authored-by: habonlaci <4699494+habonlaci@users.noreply.github.com>
Co-authored-by: Armaan Sandhu <74664101+Ar-maan05@users.noreply.github.com>
Co-authored-by: santino18727-debug <santino18727@gmail.com>
Co-authored-by: Eric (GabiDevFamily) <271972409+santino18727-debug@users.noreply.github.com>
Co-authored-by: Nitish Agarwal <1592163+nitishagar@users.noreply.github.com>
Co-authored-by: jho1-godaddy <171078705+jho1-godaddy@users.noreply.github.com>
Co-authored-by: 安妮的心动录 <74543653+anneheartrecord@users.noreply.github.com>
Co-authored-by: Harshith Gujjeti <153299927+Harshxth@users.noreply.github.com>
Co-authored-by: Tomoya Tabuchi <t@tomoyat1.com>
Co-authored-by: Vedant Agarwal <43557509+Vedant-Agarwal@users.noreply.github.com>
Co-authored-by: Prathamesh Jadhav <55660103+lollinng@users.noreply.github.com>
Co-authored-by: songkuan-zheng <252822057+songkuan-zheng@users.noreply.github.com>
Co-authored-by: Kropiunig <48442031+Kropiunig@users.noreply.github.com>
Co-authored-by: Lavish Bansal <lavish.bansal619@gmail.com>
Co-authored-by: Shane Emmons <27679+semmons99@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Anuj ojha <ojhaanuj224@gmail.com>
Co-authored-by: Nahrin <nahrin@nahrinoda.com>
Co-authored-by: Nbouyaa <67773915+FadelT@users.noreply.github.com>
Co-authored-by: Vineeth Sai <vineethsai4444@gmail.com>
Co-authored-by: Eugene Lugovtsov <34510252+EugeneLugovtsov@users.noreply.github.com>
Co-authored-by: Yevhen Luhovtsov <yevhen.luhovtsov@intapp.com>
Co-authored-by: Ayush Shekhar <106994833+ayushh0110@users.noreply.github.com>
Co-authored-by: Ahmad Shahzad <107808273+shzdehmd@users.noreply.github.com>
Co-authored-by: Kent <72616338+kingdoooo@users.noreply.github.com>
Co-authored-by: Jón Levy <levy@apro.is>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: volcengine doubao models with tiered_pricing are tracked as $0 cost

3 participants