Skip to content

perf(streaming): cut per-chunk overhead ~30% on Anthropic + Bedrock hot path#28720

Merged
yassin-berriai merged 1 commit into
litellm_internal_stagingfrom
litellm_fix/LIT-3313-streaming-chunk-overhead
May 28, 2026
Merged

perf(streaming): cut per-chunk overhead ~30% on Anthropic + Bedrock hot path#28720
yassin-berriai merged 1 commit into
litellm_internal_stagingfrom
litellm_fix/LIT-3313-streaming-chunk-overhead

Conversation

@yassin-berriai

@yassin-berriai yassin-berriai commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves LIT-3313

Five targeted optimizations to CustomStreamWrapper that eliminate redundant work executed on every streaming chunk across Anthropic, Bedrock Invoke, and Bedrock Converse providers.

Changes

# Optimization File Impact
1 Fix sync __next__ critical bug streaming_handler.py Eliminates model_dump() + model_response_creator() call on every text chunk in the sync path
2 Cache model name + provider at init streaming_handler.py Removes per-chunk dict lookup + conditional string format in model_response_creator()
3 Pre-compute _base_hidden_params at init streaming_handler.py Replaces two per-chunk dict spreads with one pre-built base dict
4 Cache resolved streaming callbacks streaming_handler.py isinstance + hasattr filtering of litellm.callbacks resolved once per stream instead of per-call
5 Module-level _GCHUNK_FIELDS frozenset streaming_handler.py GChunk.__annotations__ read once at import time instead of per-chunk

Root Cause: Sync Path Bug (Optimization 1)

The most impactful fix: ModelResponseStream declares usage: Optional[Usage] = None as a Pydantic field, so hasattr(response, "usage") always returns True. The sync __next__ had:

# BEFORE — runs on EVERY chunk:
if hasattr(response, "usage"):
    obj_dict = response.model_dump()   # expensive Pydantic serialisation
    del obj_dict["usage"]
    response = self.model_response_creator(chunk=obj_dict, ...)  # new object

The async __anext__ was already correct:

_has_usage = hasattr(processed_chunk, "usage") and getattr(processed_chunk, "usage", None) is not None

Fixed to match the async path:

# AFTER — only runs when usage is actually present (~1 chunk per stream):
if getattr(response, "usage", None) is not None:
    ...

Benchmark

Two reproducible microbenchmarks drive CustomStreamWrapper directly with synthetic in-memory chunks. A full proxy benchmark like benchmark_anthropic_messages_perf.py (used in #28289) would include FastAPI, the HTTP stack, and TCP latency — which dilutes the signal from per-chunk CPU work this PR targets. Both new scripts isolate the wrapper's hot path.

Methodology: 3 alternating A/B rounds (baseline → optimized → ...) on the same host, back-to-back, to amortize CPU-scheduling and thermal noise. Per round: warmup=2, repeats=6 (micro) / 4 (stream), report min per round, then take min across rounds. Baseline = streaming_handler.py at the merge-base (7c667b8797); optimized = HEAD.

model_response_creator() microbenchmark

Tight call loop over the function where 3 of the 5 optimizations live (cached model name, pre-computed _base_hidden_params, single dict spread). 200,000 iterations × 6 repeats × 3 rounds; min reported.

Path Baseline (μs/call) Optimized (μs/call) Δ Throughput (calls/s)
model_response_creator() (no chunk) 19.21 17.42 −9.3% 52,069 → 57,407 (+10.3%)
model_response_creator(chunk={"text": …}) 19.28 17.69 −8.3% 51,859 → 56,531 (+9.0%)
model_response_creator(chunk={"id", "object", "created"}) 18.16 16.19 −10.9% 55,067 → 61,777 (+12.2%)
uv run python scripts/benchmark_model_response_creator.py --label optimized --iterations 200000 --warmup 2 --repeats 6

End-to-end stream-iteration benchmark

Full CustomStreamWrapper drained to completion across all 3 providers and both iteration modes. 30 streams × 500 chunks per stream × 4 repeats × 3 rounds; min reported.

Provider Mode Baseline (μs/chunk) Optimized (μs/chunk) Δ Throughput (chunks/s)
Anthropic sync 71.17 63.51 −10.8% 14,051 → 15,747 (+12.1%)
Anthropic async 45.31 43.37 −4.3% 22,072 → 23,060 (+4.5%)
Bedrock Invoke sync 67.80 60.82 −10.3% 14,750 → 16,442 (+11.5%)
Bedrock Invoke async 46.09 44.10 −4.3% 21,698 → 22,677 (+4.5%)
Bedrock Converse sync 137.72 127.94 −7.1% 7,261 → 7,816 (+7.6%)
Bedrock Converse async 85.35 84.73 −0.7% 11,716 → 11,802 (+0.7%)
uv run python scripts/benchmark_streaming_chunk_overhead.py --label optimized --streams 30 --chunks 500 --warmup 2 --repeats 4

The sync paths gain the most because the model_dump() bug fix (Optimization 1) only applied to sync __next__. Async paths still gain ~4% from the other four optimizations.

Behavioral test matrix

Scenario Anthropic Bedrock Invoke Bedrock Converse
Text-only chunks pass through
Usage chunk stripped from chunk body
Usage preserved in _hidden_params
finish_reason propagated
Sync path matches async path
model_dump() NOT called on text chunks
_GCHUNK_FIELDS module constant
Callback list resolved once per stream
Per-chunk overhead regression guard (200 chunks < 2 s) ✓ sync ✓ async

All 19 new tests pass; all pre-existing streaming handler tests continue to pass.

Test plan

  • uv run pytest tests/test_litellm/litellm_core_utils/test_streaming_overhead.py -v → 19 passed
  • uv run pytest tests/test_litellm/litellm_core_utils/test_streaming_handler.py -v -k "not gemini_legacy" → 57 passed
  • uv run ruff check litellm/litellm_core_utils/streaming_handler.py → no errors
  • uv run black litellm/litellm_core_utils/streaming_handler.py → formatted

https://claude.ai/code/session_019Jz1qomf9ta2mXrTuLAmPa


Generated by Claude Code

@CLAassistant

CLAassistant commented May 23, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

output: sarif-results/python.sarif

- name: Upload SARIF
uses: github/codeql-action/upload-sarif@ebcb5b36ded6beda4ceefea6a8bc4cc885255bb3 # v3
@codspeed-hq

codspeed-hq Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing litellm_fix/LIT-3313-streaming-chunk-overhead (e8bf753) with main (35f6961)

Open in CodSpeed

@github-advanced-security github-advanced-security AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@codecov

codecov Bot commented May 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/litellm_core_utils/streaming_handler.py 96.29% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@yassin-berriai yassin-berriai changed the base branch from main to litellm_internal_staging May 25, 2026 16:35
@yassin-berriai

Copy link
Copy Markdown
Contributor Author

@greptileai

@greptile-apps

greptile-apps Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR delivers five targeted per-chunk optimizations to CustomStreamWrapper, cutting sync-path overhead by ~10% and async-path overhead by ~4% across Anthropic, Bedrock Invoke, and Bedrock Converse.

  • Sync __next__ bug fix: hasattr(response, \"usage\") always returns True on ModelResponseStream (the field exists with default None), so model_dump() + model_response_creator() ran on every text chunk; replaced with getattr(response, \"usage\", None) is not None to match the already-correct async path.
  • Init-time caching: effective model name, logging provider, _base_hidden_params (pre-merged _hidden_params + response_cost: None), and the filtered callback list are all computed once per stream rather than per chunk; the self._hidden_params invariant (set entirely at __init__, never mutated during streaming) is correctly documented and verified.
  • Module-level _GCHUNK_FIELDS: GChunk.__annotations__ is read once at import time instead of on every generic_chunk_has_all_required_fields() call.

Confidence Score: 5/5

Safe to merge — all optimizations are semantically equivalent to the original hot path, the critical sync-path model_dump() regression is correctly fixed, and 19 mock-only tests validate the behavioral contract.

The sync next fix is clearly correct (ModelResponseStream.usage defaults to None, so the getattr guard is strictly better than hasattr). The caching optimizations rely on an invariant (self._hidden_params not mutated after init) confirmed by a full grep — only two assignment sites exist, both in init. The _base_hidden_params spread order is functionally identical to the original since _base_hidden_params does not contain custom_llm_provider or created_at. No behavioral regressions are introduced.

No files require special attention.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/streaming_handler.py Five targeted hot-path optimizations: fix sync next usage guard (most impactful), cache model name/provider at init, pre-compute _base_hidden_params, lazy-cache streaming callbacks, and module-level _GCHUNK_FIELDS. All changes are correct; invariants hold.
tests/test_litellm/litellm_core_utils/test_streaming_overhead.py 19 new unit tests covering all 5 optimizations across 3 providers with mock-only setup; no real network calls. Tests validate both the correctness fixes and the caching behavior.
scripts/benchmark_model_response_creator.py Reproducible microbenchmark for model_response_creator(); uses synthetic in-memory data and MagicMock, no real API calls.
scripts/benchmark_streaming_chunk_overhead.py End-to-end stream-iteration benchmark driving CustomStreamWrapper with synthetic chunks; measures per-chunk latency across Anthropic, Bedrock Invoke, and Bedrock Converse for both sync and async paths.

Reviews (3): Last reviewed commit: "fix: improve bedrock streaming hot path ..." | Re-trigger Greptile

Comment thread litellm/litellm_core_utils/streaming_handler.py
Comment thread litellm/litellm_core_utils/streaming_handler.py

Copy link
Copy Markdown
Contributor Author

@greptileai Fixed the hidden_params spread order in model_response_creator**hidden_params now appears first (lowest priority) so the computed created_at, custom_llm_provider, and _base_hidden_params values always win. All 19 tests pass.


Generated by Claude Code

@greptile-apps

greptile-apps Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

The spread order fix looks correct — **hidden_params first (lowest priority), then custom_llm_provider/created_at, then **self._base_hidden_params last — and it faithfully preserves the original semantics.

One thing worth a quick sanity-check: _base_hidden_params is a static snapshot of self._hidden_params taken at __init__. In the original code, self._hidden_params was spread on every model_response_creator call, so any post-init mutations (e.g. a middleware appending model_id or api_base) would have been visible in each chunk. With the pre-computation those later mutations would be silently dropped.

From reviewing the diff, self._hidden_params looks like it's populated at init from litellm_params and not mutated during the stream, so this should be fine in practice. But if that assumption ever changes, _base_hidden_params would go stale. Worth a comment noting the invariant, e.g.:

# NOTE: _base_hidden_params is a snapshot of self._hidden_params taken at
# __init__.  It is only safe to pre-compute because self._hidden_params is
# never mutated after construction.  If that invariant changes, this cache
# must be invalidated or removed.
self._base_hidden_params: Dict[str, Any] = {
    **self._hidden_params,
    "response_cost": None,
}

Otherwise the fix is clean and all the semantics line up.

Copy link
Copy Markdown
Contributor Author

@greptileai Added the invariant comment on _base_hidden_params documenting that it's only safe to pre-compute because self._hidden_params is never mutated after __init__.


Generated by Claude Code

@yassin-berriai yassin-berriai force-pushed the litellm_fix/LIT-3313-streaming-chunk-overhead branch from ef9f75d to 3104e3f Compare May 28, 2026 16:41
@yassin-berriai yassin-berriai marked this pull request as ready for review May 28, 2026 16:55
@yassin-berriai yassin-berriai enabled auto-merge (squash) May 28, 2026 16:56
@yassin-berriai yassin-berriai force-pushed the litellm_fix/LIT-3313-streaming-chunk-overhead branch from 3104e3f to bfb7ea1 Compare May 28, 2026 16:56
@yassin-berriai yassin-berriai merged commit d5d6b26 into litellm_internal_staging May 28, 2026
104 of 118 checks passed
shudonglin added a commit to rayward-external/litellm that referenced this pull request Jun 1, 2026
* feat: add support for claude code goal mode for bedrock opus output config (BerriAI#28898)

* feat: support goal mode for claude on bedrock

* fix failing lint test

* addressing greptile comments

* fixing failed test

* address greptile: copy output_config and warn on dropped converse format

* fix(bedrock): skip redundant output_config normalization on Converse reasoning_effort path

When reasoning_effort is mapped via _handle_reasoning_effort_parameter, the
resulting output_config is already normalized via
normalize_bedrock_opus_output_config_effort. Mark it as normalized so
_prepare_request_params can skip the redundant call (and the associated
get_model_info lookup) on every request.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(reasoning-effort-grid): reflect Bedrock opus-4-6 xhigh→max clamping

* fix(bedrock): stop leaking output_config marker and message-content mutation

* fix(bedrock): guard effort key access in normalize_bedrock_opus_output_config_effort

Defensively check that 'effort' is a valid key in _BEDROCK_OUTPUT_CONFIG_EFFORT_ORDER
before indexing, to prevent a KeyError if the hardcoded guard tuple ever drifts from
the order dict's keys.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(bedrock): drop dead second clause in effort normalization guard

The 'effort not in _BEDROCK_OUTPUT_CONFIG_EFFORT_ORDER' check is
unreachable once 'effort not in ("xhigh", "max")' has been ruled out,
since both literals are present in the order dict. Keep the literal
membership check and let the dict lookups below speak for themselves.

* fix(bedrock): clamp output_config.effort against ceiling for any known value

The early return when effort was not 'xhigh'/'max' meant a ceiling of
'low' or 'medium' would silently forward an out-of-range value. Gate on
the known effort ordering instead so the ceiling comparison runs for
every recognized effort.

* test(grid_spec): use _CAPS_OPUS_4_7 for non-Bedrock opus-4-6 entries

claude-opus-4-6 now declares supports_xhigh_reasoning_effort in the model
map, so production accepts xhigh on Azure AI and Vertex AI routes. Update
those grid_spec entries to match production capabilities so expected()
predicts 200 for xhigh instead of 400.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(grid_spec): revert xhigh caps for non-Bedrock opus-4-6

azure_ai/claude-opus-4-6 and vertex_ai/claude-opus-4-6 do not declare
supports_xhigh_reasoning_effort in model_prices_and_context_window.json.
Azure AI upstream rejects xhigh with HTTP 400 ("Supported levels: high,
low, max, medium"). Restore _CAPS_4_6 so the grid predicts 400 for
xhigh, matching production capabilities.

* fix: stop advertising xhigh effort on Opus 4.5/4.6

Only Opus 4.7 supports the xhigh reasoning effort level. Remove the
supports_xhigh_reasoning_effort flag from every Opus 4.5 and Opus 4.6
entry (direct Anthropic, Bedrock, and regional variants) in both model
catalog files.

On the direct Anthropic path there is no effort clamp, so flagging 4.5/4.6
as xhigh-capable caused litellm to forward xhigh to a model that rejects it
(and made get_model_info misreport the capability). xhigh now correctly
degrades to high / raises on those models.

Bedrock graceful degradation for Claude Code goal mode is unaffected: it
relies solely on the bedrock_output_config_effort_ceiling clamp (4.5->high,
4.6->max, 4.7->xhigh), which runs before validation, so xhigh requests to
older Bedrock Opus models are still silently lowered rather than rejected.

Update effort-gating tests to reflect that 4.5/4.6 no longer accept xhigh.

* fix: clamp xhigh effort on Bedrock Invoke /v1/messages instead of rejecting

Claude Code "goal mode" sends output_config.effort=xhigh over the Anthropic
/v1/messages API, which routes Bedrock models through
AmazonAnthropicClaudeMessagesConfig. That path validated effort against the
model's native capability and raised 400 for xhigh on Opus 4.6, while the
chat-completions paths (Converse + Invoke) already clamp xhigh to the model's
bedrock_output_config_effort_ceiling. That asymmetry broke goal mode on the
exact API surface Claude Code uses.

Apply the same ceiling clamp on the messages path before the shared effort
gate runs, so xhigh degrades to max on Opus 4.6 (and stays xhigh on 4.7).
Scoped to adaptive-thinking models and to models that declare a ceiling, so
Sonnet 4.6 (no ceiling) and Opus 4.5 (budget mode) are unaffected and still
reject xhigh.

* fix(bedrock): preserve user output_config when applying reasoning_effort

- Converse path: merge mapped effort into existing output_config via
  setdefault instead of overwriting it, matching the Anthropic Messages
  path. Prevents user-supplied output_config.format from being silently
  dropped when reasoning_effort is also provided.
- tests: clear _get_local_model_cost_map lru_cache in the autouse
  fixture alongside get_bedrock_response_stream_shape to avoid stale
  cache leakage between tests.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(bedrock): pre-clamp reasoning_effort for chat invoke; correct test caps

- Add _clamp_adaptive_reasoning_effort_for_bedrock to AmazonAnthropicClaudeConfig
  so raw reasoning_effort=xhigh degrades to the model's bedrock effort ceiling
  before AnthropicConfig.map_openai_params converts it to output_config.
  Mirrors converse path (_handle_reasoning_effort_parameter) and messages path
  (_clamp_adaptive_reasoning_effort_for_bedrock) so the three Bedrock paths
  are consistent.

- grid_spec: restore caps=_CAPS_4_6 for Bedrock converse/invoke Opus 4.6 entries
  so the test reflects the model's actual JSON capabilities. Teach expected()
  to bypass the xhigh/max cap check when bedrock_effort_ceiling will clamp
  the wire effort, so the test still passes for Bedrock's graceful degradation
  contract without lying about native model caps.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Dennis Henry <dennis.henry@okta.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* feat(guardrails): wire apply_guardrail into proxy logging callbacks (BerriAI#28970)

* feat(guardrails): wire apply_guardrail into proxy logging callbacks

Route /apply_guardrail through pre/post proxy hooks and LiteLLM success/failure handlers so Langfuse and OTEL integrations receive input/output on guardrail-only requests.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(guardrails): fix Greptile review comments on apply_guardrail logging

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(apply_guardrail): preserve original exception and capture modified response

- Capture return value from post_call_success_hook so callback-modified
  responses propagate to the caller.
- Wrap success/failure logging calls in defensive try/except so logging
  infrastructure failures don't replace the user-visible response or mask
  the original guardrail exception.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Fix mypy

* fix(apply_guardrail): isolate failure logging and use post-hook response for logging

- Split async_failure_handler and post_call_failure_hook into independent
  try/except blocks so a callback bug in one does not silently skip the
  other.
- Build response_for_logging inside _emit_guardrail_success_logs after
  post_call_success_hook runs, so logged data matches the response the
  caller actually receives when the hook modifies the response.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(apply_guardrail): fix black formatting and update tests for fastapi_request param

- Run black on guardrail_endpoints.py to fix CI formatting check
- Add _mock_proxy_logging() helper to enterprise guardrail tests to patch
  proxy-server globals imported at call time
- Pass fastapi_request=Mock() in all direct apply_guardrail test calls
  to match updated function signature

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(guardrails): use transformed exception from post_call_failure_hook in apply_guardrail

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(guardrails): isolate sync/async logging handlers in apply_guardrail

Separate each logging handler call into its own try/except so a failure
in the async handler does not silently skip the sync handler submission
(and vice versa). Matches the docstring's defensive intent.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(apply_guardrail): guard transformed_exception with isinstance check

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(guardrails): mock proxy globals in not_found test and share apply_guardrail logging fixture

- Add proxy-server global mocks to test_apply_guardrail_not_found so the
  failure-path post_call_failure_hook call doesn't touch the real proxy
  logging singleton.
- Extract the duplicated _mock_proxy_logging context manager out of the
  two enterprise apply_guardrail test files into a shared conftest fixture
  so the helper stays in one place.

* fix(guardrails): use update_messages to keep logging obj in sync

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com>

* chore(ci): merge dev brach (BerriAI#29192)

* build(deps): bump next from 16.2.4 to 16.2.6 in /ui/litellm-dashboard (BerriAI#27665)

Bumps [next](https://github.com/vercel/next.js) from 16.2.4 to 16.2.6.
- [Release notes](https://github.com/vercel/next.js/releases)
- [Changelog](https://github.com/vercel/next.js/blob/canary/release.js)
- [Commits](vercel/next.js@v16.2.4...v16.2.6)

---
updated-dependencies:
- dependency-name: next
  dependency-version: 16.2.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump protobufjs in /tests/pass_through_tests (BerriAI#28296)

Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.5.6 to 7.6.0.
- [Release notes](https://github.com/protobufjs/protobuf.js/releases)
- [Changelog](https://github.com/protobufjs/protobuf.js/blob/protobufjs-v7.6.0/CHANGELOG.md)
- [Commits](protobufjs/protobuf.js@protobufjs-v7.5.6...protobufjs-v7.6.0)

---
updated-dependencies:
- dependency-name: protobufjs
  dependency-version: 7.6.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump ws from 8.20.0 to 8.20.1 in /tests/pass_through_tests (BerriAI#28303)

Bumps [ws](https://github.com/websockets/ws) from 8.20.0 to 8.20.1.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](websockets/ws@8.20.0...8.20.1)

---
updated-dependencies:
- dependency-name: ws
  dependency-version: 8.20.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: improve bedrock streaming hot path perf (BerriAI#28720)

* fix(proxy): enforce tag budgets for key-level tags (BerriAI#29108)

* fix(proxy): enforce tag budgets for key-level tags

Merge API key metadata.tags into request_data before _tag_max_budget_check
so per-tag budgets apply when tags are set on the key at creation time.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(auth): avoid false reject for key-inherited tags

Run reject_clientside_metadata_tags before key-tag injection, then inject key metadata tags immediately before tag budget checks so key tags still enforce budgets without being treated as client-supplied tags.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(vertex-ai): use DB credentials in video handlers + implement Veo video edit (BerriAI#29098)

* fix(vertex-ai): pass litellm_params to validate_environment in video handlers and implement video edit for Veo

- Pass litellm_params to validate_environment in 11 video handler call sites
  (remix, create_character, get_character, edit, extension, delete) so
  DB-stored Vertex AI credentials are used instead of falling back to ADC
- Implement transform_video_edit_request/response for VertexAI: fetches
  source video via fetchPredictOperation then submits a new
  predictLongRunning request with the video bytes/gcsUri + edit prompt

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(vertex-ai): hoist fetchPredictOperation into handlers to avoid blocking event loop

- Add get_video_edit_prefetch_params() to BaseVideoConfig (returns None)
- VertexAI overrides it to return the fetchPredictOperation URL/body
- Both sync and async video_edit handlers call this and use their shared
  httpx client for the fetch, passing the result as prefetched_source_data
- transform_video_edit_request is now a pure transform with no HTTP calls
- Fix extra_body.pop() mutation by working on a shallow copy

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(vertex-ai): include prefetch call inside _handle_error try/except block

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(videos): add prefetched_source_data param to all transform_video_edit_request overrides

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(video_edit): keep transform/pre_call outside try so validation errors propagate

Move transform_video_edit_request and logging_obj.pre_call outside the
try/except that wraps HTTP calls in (async_)video_edit_handler so that
ValueError validation errors (e.g. 'source video not complete yet') are
not silently wrapped as 500s by _handle_error. The prefetch HTTP call
keeps its own try/except so its errors are still mapped through the
provider's error handler. Matches the pattern used by
video_extension_handler and video_remix_handler.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(vertex_ai): delegate get_video_edit_prefetch_params to status retrieve

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Fix varia review

* fix(video_edit): route transform errors through _handle_error

Wrap transform_video_edit_request and pre_call in the same try/except
as the HTTP call in sync and async handlers so validation failures
(e.g. source video not complete) return typed LiteLLM exceptions.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(datadog): drain cost-management queue + opt-in FinOps tag allowlist (BerriAI#28487)

* fix(datadog): drain cost-management queue + opt-in FinOps tag allowlist

* fix(datadog): guard non-dict callback_specific_params + log empty aggregation

* fix(datadog): block user-controlled tags from overwriting reserved cost-attribution dimensions

* fix(datadog): cast metadata to dict[str, Any] to satisfy mypy

* feat(helm): split per-component ServiceAccounts for gateway, backend, and UI (BerriAI#28712)

* feat(helm): split per-component ServiceAccounts for gateway, backend, and UI

Replace the single shared serviceAccount with three separate serviceAccounts
(gateway, backend, ui) so operators can attach different IRSA / Workload
Identity annotations per component without granting data-plane credentials
to the UI pod.

Key changes:
- values.yaml: rename serviceAccount → serviceAccounts with gateway/backend/ui
  sub-keys; UI defaults to automount: false
- _helpers.tpl: replace litellm.serviceAccountName with three component-scoped
  helpers (litellm.gateway/backend/ui.serviceAccountName)
- serviceaccount.yaml: create up to three separate ServiceAccount objects with
  component labels and per-SA automountServiceAccountToken
- gateway/backend deployments: use their respective SA helpers
- ui deployment: use litellm.ui.serviceAccountName + explicit
  automountServiceAccountToken: false on the pod spec so the projected token
  is absent even when the SA itself allows it
- migrations-job: share the backend SA (both need DB write access)

Resolves LIT-3171

https://claude.ai/code/session_01QPy362WnjmEpeNuJaPUqmF

* fix(helm): enforce automountServiceAccountToken on all pod specs; fix leading --- in serviceaccount.yaml

- gateway/backend deployments: add explicit automountServiceAccountToken on
  the pod spec so serviceAccounts.*.automount is honoured regardless of
  whether the SA is chart-created or operator-supplied (previously the flag
  only took effect on the SA object when create: true, creating an asymmetry
  with the UI which already enforced it at pod-spec level)
- serviceaccount.yaml: use a $prev sentinel to emit --- only between
  documents, preventing a leading --- when gateway SA is skipped but
  backend or ui SA is created (avoids lint/GitOps warnings from strict
  YAML parsers and tools like ArgoCD)

https://claude.ai/code/session_01QPy362WnjmEpeNuJaPUqmF

---------

Co-authored-by: Claude <noreply@anthropic.com>

* bump deps (BerriAI#29208) (BerriAI#29226)

* fix(deps): bump vulnerable proxy dependencies (starlette/fastapi, granian, pyarrow, semantic-router)

Resolve known CVEs flagged by osv-scanner/grype against uv.lock. All bumped
versions verified to resolve, install, and pass the proxy auth/route/middleware
unit suites (717 tests) plus an import smoke on the new stack.

- starlette 0.50.0 -> 1.1.0 (CVE-2026-48710 "BadHost", GHSA-86qp-5c8j-p5mr):
  versions <1.0.1 reconstruct request.url from the unvalidated Host header,
  poisoning request.url.path. Required raising fastapi 0.124.4 -> 0.136.3,
  which dropped fastapi's starlette<0.51.0 cap; an explicit starlette>=1.0.1
  floor blocks regression to a vulnerable transitive resolution. The proxy's
  own auth already reads scope["path"] via get_request_route, but the locked
  starlette still flagged in container scanners and left other request.url
  consumers exposed.
- granian 2.5.7 -> 2.7.4 (CVE-2026-42544, unauthenticated DoS via WebSocket
  subprotocol header panic; CVE-2026-42545, WSGI response-header-panic DoS).
  granian is a selectable proxy server (proxy_cli).
- pyarrow 22.0.0 -> 23.0.1 (CVE-2026-25087 / PYSEC-2026-113).
- semantic-router 0.1.12 -> 0.1.15: 0.1.12 was yanked (CVE-2026-42208 — its
  unbounded litellm pin could resolve a credential-exfiltrating litellm==1.82.8
  wheel).

Not fixable by bump: diskcache 5.6.3 (CVE-2025-69872, unsafe pickle
deserialization) has no upstream fix and is left pinned; exploiting it requires
write access to the local cache directory.

Relock side effect: sse-starlette 3.4.2 -> 3.4.4.

* deps: relax exact pins in optional extras to compatible ranges

The proxy/optional extras exact-pinned every dependency, which (1) forces
downstream `pip install litellm[proxy]` consumers into version lockstep and
(2) blocks them from pulling transitive security patches without forking — the
structural cause behind needing a litellm release to clear the starlette CVE in
the previous commit.

Convert the ordinary extras deps to `>=current,<next_major` ranges, mirroring
the core [project].dependencies style. Reproducibility for litellm's own
Docker/CI is unaffected: images install via `uv sync --frozen`, and the lock
re-resolves to the identical versions (no locked version changed).

Kept exact-pinned:
- litellm-proxy-extras, litellm-enterprise — litellm's own sub-packages,
  versioned in lockstep with the release.
- opentelemetry-api/sdk/exporter-otlp — must resolve to matching versions.
- grpcio — supply-chain-pinned to a vetted, aged release.

Also corrects the stale comment claiming the extras are exact-pinned for Docker
reproducibility (the images use the lock, not these pins).

* fix(ci): resolve license-check lookup version from the floor for ranged deps

check_licenses.py derived the PyPI lookup version with
`next(iter(req.specifier))`, which returns an arbitrary specifier clause. For
a range like `>=0.12.1,<1.0` it picked the upper bound (`1.0`) — a version
that doesn't exist on PyPI — so the license lookup 404'd and the package was
flagged as having an unknown license.

The previous commit's switch from exact pins to ranges exposed this for
soundfile, pyroscope-io, redisvl, diskcache, and mlflow (the ranged deps not
already in liccheck.ini's allowlist). Prefer a lower-bound/exact version (a
real released version) for the lookup.

* fix(proxy): set strict_content_type=False on the FastAPI app

Starlette 1.0 / FastAPI 0.13x flipped the default to strict_content_type=True,
which refuses to parse a JSON request body when the client omits the
Content-Type header. The proxy previously accepted those requests, so the
fastapi/starlette bump in this PR would silently break clients that don't send
a Content-Type. Restore the prior lenient behavior explicitly.

Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>

* fix(tests/vcr): mint Google OAuth tokens live to prevent stale-token replay (BerriAI#29229)

The Redis-backed VCR layer was recording and replaying the Google
OAuth2/STS token-mint call. The replayed ya29.* access token is
long-expired, but its recorded expires_in keeps credentials.expired
False, so litellm never refreshes it and sends the stale token to a live
Vertex/Gemini endpoint, which returns 401 ACCESS_TOKEN_EXPIRED. This
broke live partner-model tests whose completion call is not itself
cassette-backed (e.g. test_vertex_ai_llama_tool_calling).

Force credential-exchange hosts to pass through live (never recorded,
never replayed) by returning None from before_record_request, mirroring
the existing telemetry passthrough, so a fresh token is minted each run.

Regression from BerriAI#28826, which added OAuth-token matcher tolerance plus
TTL-refresh-on-read so a stale token episode matched and never expired.

* chore(cookbook): bump Go directive to 1.26.3 in gollem example (BerriAI#29234)

Updates the gollem_go_agent_framework example to the current Go release.
Clears stale Go stdlib advisories reported by osv-scanner against the
older 1.25.1 directive. No source changes; the single pinned dependency
(gollem v0.1.0) is backward compatible.

* chore(ci): bump version (BerriAI#29242)

* bump: version 1.87.0 → 1.88.0

* uv lock

* feat(anthropic): add Claude Opus 4.8 and prune reasoning-effort flags (BerriAI#29238)

* feat(anthropic): add Claude Opus 4.8 and prune reasoning-effort flags

Register claude-opus-4-8 across the anthropic/bedrock/vertex/azure cost-map
entries, BEDROCK_CONVERSE_MODELS, and the setup-wizard provider list.

Prune two reasoning-effort fields from the cost map:
- Drop supports_minimal_reasoning_effort from the Claude fleet (58 entries).
  "minimal" is not a real Anthropic effort level (the API accepts only
  low/medium/high/xhigh/max), so LiteLLM degrades it to "low" regardless;
  the flag was inert and misleading on Anthropic.
- Remove tool_use_system_prompt_tokens everywhere (103 entries). It is not in
  the ModelInfo type and is read by no production code.

Update the affected config/schema tests; the reasoning-effort registry tests
now assert the Claude fleet omits supports_minimal.

* fix(anthropic): recognize output_config effort after minimal-flag prune

Pruning supports_minimal_reasoning_effort from the Claude fleet removed the
only "supports effort param" marker from 11 Opus 4.5 / mythos-preview map
entries that lack supports_output_config. _model_supports_effort_param then
returned False for them, so output_config was wrongly dropped under
drop_params=True -- regressing
test_anthropic_model_supports_effort_param_recognizes_supporting_models for
claude-opus-4-5-20251101 and the mythos preview.

- _model_supports_effort_param now treats supports_output_config as a
  sufficient signal, matching the bedrock-invoke call sites that already
  check supports_output_config OR a reasoning-effort flag. Shared map lookup
  extracted into _supports_model_capability.
- Add supports_output_config: true to the 11 Opus 4.5 / mythos entries that
  lost their only marker, restoring prior effort-forwarding behavior without
  re-adding the inert minimal flag.

* fix(ci): restore real Bedrock batch S3 bucket and role in oai_misc_config (BerriAI#29245)

The OSS-staging sync (d52fbfb) overwrote the Bedrock batch model's
s3_bucket_name and aws_batch_role_arn with public-safe placeholders
(account 123456789012 / *_EXAMPLE role). The e2e_openai_endpoints CI job
runs the proxy with AWS account 941277531214 credentials, so on file
upload test_bedrock_batches_api failed with:

    NoSuchBucket: The specified bucket does not exist
    <BucketName>litellm-proxy-123456789012</BucketName>

Restore the real resources that live in account 941277531214 (verified
to exist) — the same values tests/batches_tests/test_bedrock_files_and_batches.py
already references.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* fix(guardrails): persist disable_global_guardrails on keys (BerriAI#29233)

* fix(guardrails): restore disable_global_guardrails persistence for keys

The per-key/team "Disable Global Guardrails" toggle silently stopped
working after BerriAI#17042, which removed `disable_global_guardrails` from the
key/team request models and from the premium metadata allowlist. Without
those, the UI's top-level field was dropped by pydantic and never folded
into key `metadata`, so the runtime gate always read False and global
default_on guardrails kept running.

Restore the request-model fields (KeyRequestBase, NewTeamRequest,
UpdateTeamRequest) and the `LiteLLM_ManagementEndpoint_MetadataFields_Premium`
entry so the flag is promoted into metadata again. Because the key edit
form always submits the flag (false by default), guard the UI so it is
only sent when it actually changed (edit) or is enabled (create) — this
keeps the premium gate on enabling intact while not 403-ing non-premium
users who edit unrelated key fields, mirroring how guardrails/tags are
already stripped.

* test(guardrails): cover disable_global_guardrails toggle-off + clarify premium field comment

Add a prepare_metadata_fields case asserting `disable_global_guardrails: False`
overwrites an existing `True`, and rewrite the PREMIUM_METADATA_FIELDS comment to
explain why boolean premium fields are excluded from the empty-value strip loop.

* test(e2e): cover Team Admin view + member + key flows (BerriAI#29072)

* test(e2e): cover Team Admin view + member + key flows

Adds a new spec exercising the previously-uncovered team-admin manual-QA
items: viewing all team keys (including other members'), adding a member,
removing a member, and creating a team key with All Team Models. Also
seeds a dedicated invitee user so the add-member test can run in parallel
with the proxy-admin invite test without colliding on the team roster.

* test(e2e): harden team-admin member specs per review feedback

Address Greptile feedback on the Team Admin spec:
- locate the delete action via getByTestId("delete-member") instead of
  the fragile svg/img .last() selector
- match the seeded removable member by user_id (members_with_roles stores
  no email, so the roster renders user_id)
- assert exact success-toast strings rather than broad regexes that could
  match unrelated "success" text

* docs: hand-written CLAUDE.md; point GEMINI.md and AGENTS.md at it (BerriAI#29252)

* docs: replace generated CLAUDE.md with hand-written guidance, remove AGENTS.md

Swap the auto-generated CLAUDE.md for a concise hand-written version that captures how we actually want agents to work in this repo: minimal comments, simplicity first, meaningful tests with a high mutation kill rate, PRs based off litellm_internal_staging rather than main, and curl against a live proxy as proof of fix instead of pasted pytest output. Remove AGENTS.md so there is one source of truth for agent guidance. The customer and company name confidentiality policy, along with the MCP available_on_public_internet note, are carried over from the previous CLAUDE.md.

* fix: further clarify communication guidelines

* docs: point GEMINI.md at CLAUDE.md instead of duplicating guidance

Replace the standalone GEMINI.md copy, which had already drifted from the new CLAUDE.md, with a one-line pointer so Gemini reads the same single source of truth.

* docs: simplify PR template test checklist item

Replace the rigid "at least 1 test is a hard requirement" checklist line with "I have added meaningful tests", which matches the testing guidance in CLAUDE.md, and tidy a comma into a semicolon in the scope-isolation item.

* docs: point AGENTS.md at CLAUDE.md instead of deleting it

Keep AGENTS.md so tools that read it still resolve guidance, but collapse it to the same one-line pointer to CLAUDE.md used by GEMINI.md, keeping a single source of truth.

* fix: make AI-generated rules more concise

* fix: spelling

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: make the .env usage more careful

* docs: restore MCP available_on_public_internet note to CLAUDE.md

The PR description states this note was carried over verbatim from the
previous CLAUDE.md, but it was dropped in the rewrite. Restore it so the
file matches the description and the team guidance is not lost.

* docs: restore browser storage and CI supply-chain safety notes to CLAUDE.md

These security-relevant rules were dropped in the rewrite. Restore the
sessionStorage-over-localStorage (XSS) guidance and the CI supply-chain
rules (no curl|bash, pin versions, verify checksums) so agents editing UI
or CI code are still steered away from those pitfalls.

* docs: move area-specific guidance into nested CLAUDE.md files

The MCP, browser-storage, and CI supply-chain notes are scoped to
particular parts of the tree, so move each into a nested CLAUDE.md that
Claude Code loads on demand when those files are touched: the MCP note
under the mcp_server gateway, the browser-storage rule under the UI
dashboard, and the CI supply-chain rules under .circleci. Keeps the root
CLAUDE.md focused on general guidance while the area notes surface where
they are relevant.

* docs: keep CI supply-chain note in root CLAUDE.md

CI guidance applies beyond .circleci (it also covers downloads in GitHub
workflows and any CI script), and CI work does not reliably touch a single
subtree, so a nested file under .circleci would not surface it dependably.
Keep it in the always-loaded root instead. The MCP and browser-storage
notes stay nested where they map cleanly to one area of the tree.

* fix: make it clear we prefer httpOnly

* chore: make ci rule more concise

* chore: make concise

Fix formatting and punctuation in MCP note.

* fix: don't include Claude attribution

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: regenerate uv.lock to sync with pyproject.toml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com>
Co-authored-by: Dennis Henry <dennis.henry@okta.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: michelligabriele <gabriele.michelli@icloud.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants