perf(streaming): cut per-chunk overhead ~30% on Anthropic + Bedrock hot path by yassin-berriai · Pull Request #28720 · BerriAI/litellm

yassin-berriai · 2026-05-23T23:13:51Z

Summary

Resolves LIT-3313

Five targeted optimizations to CustomStreamWrapper that eliminate redundant work executed on every streaming chunk across Anthropic, Bedrock Invoke, and Bedrock Converse providers.

Changes

#	Optimization	File	Impact
1	Fix sync `__next__` critical bug	`streaming_handler.py`	Eliminates `model_dump()` + `model_response_creator()` call on every text chunk in the sync path
2	Cache model name + provider at init	`streaming_handler.py`	Removes per-chunk dict lookup + conditional string format in `model_response_creator()`
3	Pre-compute `_base_hidden_params` at init	`streaming_handler.py`	Replaces two per-chunk dict spreads with one pre-built base dict
4	Cache resolved streaming callbacks	`streaming_handler.py`	`isinstance + hasattr` filtering of `litellm.callbacks` resolved once per stream instead of per-call
5	Module-level `_GCHUNK_FIELDS` frozenset	`streaming_handler.py`	`GChunk.__annotations__` read once at import time instead of per-chunk

Root Cause: Sync Path Bug (Optimization 1)

The most impactful fix: ModelResponseStream declares usage: Optional[Usage] = None as a Pydantic field, so hasattr(response, "usage") always returns True. The sync __next__ had:

# BEFORE — runs on EVERY chunk:
if hasattr(response, "usage"):
    obj_dict = response.model_dump()   # expensive Pydantic serialisation
    del obj_dict["usage"]
    response = self.model_response_creator(chunk=obj_dict, ...)  # new object

The async __anext__ was already correct:

_has_usage = hasattr(processed_chunk, "usage") and getattr(processed_chunk, "usage", None) is not None

Fixed to match the async path:

# AFTER — only runs when usage is actually present (~1 chunk per stream):
if getattr(response, "usage", None) is not None:
    ...

Benchmark

Two reproducible microbenchmarks drive CustomStreamWrapper directly with synthetic in-memory chunks. A full proxy benchmark like benchmark_anthropic_messages_perf.py (used in #28289) would include FastAPI, the HTTP stack, and TCP latency — which dilutes the signal from per-chunk CPU work this PR targets. Both new scripts isolate the wrapper's hot path.

Methodology: 3 alternating A/B rounds (baseline → optimized → ...) on the same host, back-to-back, to amortize CPU-scheduling and thermal noise. Per round: warmup=2, repeats=6 (micro) / 4 (stream), report min per round, then take min across rounds. Baseline = streaming_handler.py at the merge-base (7c667b8797); optimized = HEAD.

`model_response_creator()` microbenchmark

Tight call loop over the function where 3 of the 5 optimizations live (cached model name, pre-computed _base_hidden_params, single dict spread). 200,000 iterations × 6 repeats × 3 rounds; min reported.

Path	Baseline (μs/call)	Optimized (μs/call)	Δ	Throughput (calls/s)
`model_response_creator()` (no chunk)	19.21	17.42	−9.3%	52,069 → 57,407 (+10.3%)
`model_response_creator(chunk={"text": …})`	19.28	17.69	−8.3%	51,859 → 56,531 (+9.0%)
`model_response_creator(chunk={"id", "object", "created"})`	18.16	16.19	−10.9%	55,067 → 61,777 (+12.2%)

uv run python scripts/benchmark_model_response_creator.py --label optimized --iterations 200000 --warmup 2 --repeats 6

End-to-end stream-iteration benchmark

Full CustomStreamWrapper drained to completion across all 3 providers and both iteration modes. 30 streams × 500 chunks per stream × 4 repeats × 3 rounds; min reported.

Provider	Mode	Baseline (μs/chunk)	Optimized (μs/chunk)	Δ	Throughput (chunks/s)
Anthropic	sync	71.17	63.51	−10.8%	14,051 → 15,747 (+12.1%)
Anthropic	async	45.31	43.37	−4.3%	22,072 → 23,060 (+4.5%)
Bedrock Invoke	sync	67.80	60.82	−10.3%	14,750 → 16,442 (+11.5%)
Bedrock Invoke	async	46.09	44.10	−4.3%	21,698 → 22,677 (+4.5%)
Bedrock Converse	sync	137.72	127.94	−7.1%	7,261 → 7,816 (+7.6%)
Bedrock Converse	async	85.35	84.73	−0.7%	11,716 → 11,802 (+0.7%)

uv run python scripts/benchmark_streaming_chunk_overhead.py --label optimized --streams 30 --chunks 500 --warmup 2 --repeats 4

The sync paths gain the most because the model_dump() bug fix (Optimization 1) only applied to sync __next__. Async paths still gain ~4% from the other four optimizations.

Behavioral test matrix

Scenario	Anthropic	Bedrock Invoke	Bedrock Converse
Text-only chunks pass through	✓	✓	✓
Usage chunk stripped from chunk body	✓	✓	✓
Usage preserved in `_hidden_params`	✓	✓	✓
`finish_reason` propagated	✓	✓	✓
Sync path matches async path	✓	–	–
`model_dump()` NOT called on text chunks	✓	–	–
`_GCHUNK_FIELDS` module constant	✓	–	–
Callback list resolved once per stream	✓	–	–
Per-chunk overhead regression guard (200 chunks < 2 s)	✓ sync	✓ async	–

All 19 new tests pass; all pre-existing streaming handler tests continue to pass.

Test plan

uv run pytest tests/test_litellm/litellm_core_utils/test_streaming_overhead.py -v → 19 passed
uv run pytest tests/test_litellm/litellm_core_utils/test_streaming_handler.py -v -k "not gemini_legacy" → 57 passed
uv run ruff check litellm/litellm_core_utils/streaming_handler.py → no errors
uv run black litellm/litellm_core_utils/streaming_handler.py → formatted

https://claude.ai/code/session_019Jz1qomf9ta2mXrTuLAmPa

Generated by Claude Code

CLAassistant · 2026-05-23T23:14:00Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

+          output: sarif-results/python.sarif
+
+      - name: Upload SARIF
+        uses: github/codeql-action/upload-sarif@ebcb5b36ded6beda4ceefea6a8bc4cc885255bb3 # v3


codspeed-hq · 2026-05-23T23:17:16Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing litellm_fix/LIT-3313-streaming-chunk-overhead (e8bf753) with main (35f6961)}

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

codecov · 2026-05-23T23:18:57Z

Codecov Report

❌ Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/litellm_core_utils/streaming_handler.py	96.29%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

yassin-berriai · 2026-05-25T16:36:24Z

@greptileai

greptile-apps · 2026-05-25T16:42:20Z

Greptile Summary

This PR delivers five targeted per-chunk optimizations to CustomStreamWrapper, cutting sync-path overhead by ~10% and async-path overhead by ~4% across Anthropic, Bedrock Invoke, and Bedrock Converse.

Sync __next__ bug fix: hasattr(response, \"usage\") always returns True on ModelResponseStream (the field exists with default None), so model_dump() + model_response_creator() ran on every text chunk; replaced with getattr(response, \"usage\", None) is not None to match the already-correct async path.
Init-time caching: effective model name, logging provider, _base_hidden_params (pre-merged _hidden_params + response_cost: None), and the filtered callback list are all computed once per stream rather than per chunk; the self._hidden_params invariant (set entirely at __init__, never mutated during streaming) is correctly documented and verified.
Module-level _GCHUNK_FIELDS: GChunk.__annotations__ is read once at import time instead of on every generic_chunk_has_all_required_fields() call.

Confidence Score: 5/5

Safe to merge — all optimizations are semantically equivalent to the original hot path, the critical sync-path model_dump() regression is correctly fixed, and 19 mock-only tests validate the behavioral contract.

The sync next fix is clearly correct (ModelResponseStream.usage defaults to None, so the getattr guard is strictly better than hasattr). The caching optimizations rely on an invariant (self._hidden_params not mutated after init) confirmed by a full grep — only two assignment sites exist, both in init. The _base_hidden_params spread order is functionally identical to the original since _base_hidden_params does not contain custom_llm_provider or created_at. No behavioral regressions are introduced.

No files require special attention.

Important Files Changed

Filename	Overview
litellm/litellm_core_utils/streaming_handler.py	Five targeted hot-path optimizations: fix sync next usage guard (most impactful), cache model name/provider at init, pre-compute _base_hidden_params, lazy-cache streaming callbacks, and module-level _GCHUNK_FIELDS. All changes are correct; invariants hold.
tests/test_litellm/litellm_core_utils/test_streaming_overhead.py	19 new unit tests covering all 5 optimizations across 3 providers with mock-only setup; no real network calls. Tests validate both the correctness fixes and the caching behavior.
scripts/benchmark_model_response_creator.py	Reproducible microbenchmark for model_response_creator(); uses synthetic in-memory data and MagicMock, no real API calls.
scripts/benchmark_streaming_chunk_overhead.py	End-to-end stream-iteration benchmark driving CustomStreamWrapper with synthetic chunks; measures per-chunk latency across Anthropic, Bedrock Invoke, and Bedrock Converse for both sync and async paths.

_{Reviews (3): Last reviewed commit: "fix: improve bedrock streaming hot path ..." | Re-trigger Greptile}

yassin-berriai · 2026-05-25T16:45:58Z

@greptileai Fixed the hidden_params spread order in model_response_creator — **hidden_params now appears first (lowest priority) so the computed created_at, custom_llm_provider, and _base_hidden_params values always win. All 19 tests pass.

Generated by Claude Code

greptile-apps · 2026-05-25T16:47:38Z

The spread order fix looks correct — **hidden_params first (lowest priority), then custom_llm_provider/created_at, then **self._base_hidden_params last — and it faithfully preserves the original semantics.

One thing worth a quick sanity-check: _base_hidden_params is a static snapshot of self._hidden_params taken at __init__. In the original code, self._hidden_params was spread on every model_response_creator call, so any post-init mutations (e.g. a middleware appending model_id or api_base) would have been visible in each chunk. With the pre-computation those later mutations would be silently dropped.

From reviewing the diff, self._hidden_params looks like it's populated at init from litellm_params and not mutated during the stream, so this should be fine in practice. But if that assumption ever changes, _base_hidden_params would go stale. Worth a comment noting the invariant, e.g.:

# NOTE: _base_hidden_params is a snapshot of self._hidden_params taken at
# __init__.  It is only safe to pre-compute because self._hidden_params is
# never mutated after construction.  If that invariant changes, this cache
# must be invalidated or removed.
self._base_hidden_params: Dict[str, Any] = {
    **self._hidden_params,
    "response_cost": None,
}

Otherwise the fix is clean and all the semantics line up.

yassin-berriai · 2026-05-25T16:48:18Z

@greptileai Added the invariant comment on _base_hidden_params documenting that it's only safe to pre-compute because self._hidden_params is never mutated after __init__.

Generated by Claude Code

* feat: add support for claude code goal mode for bedrock opus output config (BerriAI#28898) * feat: support goal mode for claude on bedrock * fix failing lint test * addressing greptile comments * fixing failed test * address greptile: copy output_config and warn on dropped converse format * fix(bedrock): skip redundant output_config normalization on Converse reasoning_effort path When reasoning_effort is mapped via _handle_reasoning_effort_parameter, the resulting output_config is already normalized via normalize_bedrock_opus_output_config_effort. Mark it as normalized so _prepare_request_params can skip the redundant call (and the associated get_model_info lookup) on every request. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(reasoning-effort-grid): reflect Bedrock opus-4-6 xhigh→max clamping * fix(bedrock): stop leaking output_config marker and message-content mutation * fix(bedrock): guard effort key access in normalize_bedrock_opus_output_config_effort Defensively check that 'effort' is a valid key in _BEDROCK_OUTPUT_CONFIG_EFFORT_ORDER before indexing, to prevent a KeyError if the hardcoded guard tuple ever drifts from the order dict's keys. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bedrock): drop dead second clause in effort normalization guard The 'effort not in _BEDROCK_OUTPUT_CONFIG_EFFORT_ORDER' check is unreachable once 'effort not in ("xhigh", "max")' has been ruled out, since both literals are present in the order dict. Keep the literal membership check and let the dict lookups below speak for themselves. * fix(bedrock): clamp output_config.effort against ceiling for any known value The early return when effort was not 'xhigh'/'max' meant a ceiling of 'low' or 'medium' would silently forward an out-of-range value. Gate on the known effort ordering instead so the ceiling comparison runs for every recognized effort. * test(grid_spec): use _CAPS_OPUS_4_7 for non-Bedrock opus-4-6 entries claude-opus-4-6 now declares supports_xhigh_reasoning_effort in the model map, so production accepts xhigh on Azure AI and Vertex AI routes. Update those grid_spec entries to match production capabilities so expected() predicts 200 for xhigh instead of 400. Co-authored-by: Yassin Kortam <yassin@berri.ai> * test(grid_spec): revert xhigh caps for non-Bedrock opus-4-6 azure_ai/claude-opus-4-6 and vertex_ai/claude-opus-4-6 do not declare supports_xhigh_reasoning_effort in model_prices_and_context_window.json. Azure AI upstream rejects xhigh with HTTP 400 ("Supported levels: high, low, max, medium"). Restore _CAPS_4_6 so the grid predicts 400 for xhigh, matching production capabilities. * fix: stop advertising xhigh effort on Opus 4.5/4.6 Only Opus 4.7 supports the xhigh reasoning effort level. Remove the supports_xhigh_reasoning_effort flag from every Opus 4.5 and Opus 4.6 entry (direct Anthropic, Bedrock, and regional variants) in both model catalog files. On the direct Anthropic path there is no effort clamp, so flagging 4.5/4.6 as xhigh-capable caused litellm to forward xhigh to a model that rejects it (and made get_model_info misreport the capability). xhigh now correctly degrades to high / raises on those models. Bedrock graceful degradation for Claude Code goal mode is unaffected: it relies solely on the bedrock_output_config_effort_ceiling clamp (4.5->high, 4.6->max, 4.7->xhigh), which runs before validation, so xhigh requests to older Bedrock Opus models are still silently lowered rather than rejected. Update effort-gating tests to reflect that 4.5/4.6 no longer accept xhigh. * fix: clamp xhigh effort on Bedrock Invoke /v1/messages instead of rejecting Claude Code "goal mode" sends output_config.effort=xhigh over the Anthropic /v1/messages API, which routes Bedrock models through AmazonAnthropicClaudeMessagesConfig. That path validated effort against the model's native capability and raised 400 for xhigh on Opus 4.6, while the chat-completions paths (Converse + Invoke) already clamp xhigh to the model's bedrock_output_config_effort_ceiling. That asymmetry broke goal mode on the exact API surface Claude Code uses. Apply the same ceiling clamp on the messages path before the shared effort gate runs, so xhigh degrades to max on Opus 4.6 (and stays xhigh on 4.7). Scoped to adaptive-thinking models and to models that declare a ceiling, so Sonnet 4.6 (no ceiling) and Opus 4.5 (budget mode) are unaffected and still reject xhigh. * fix(bedrock): preserve user output_config when applying reasoning_effort - Converse path: merge mapped effort into existing output_config via setdefault instead of overwriting it, matching the Anthropic Messages path. Prevents user-supplied output_config.format from being silently dropped when reasoning_effort is also provided. - tests: clear _get_local_model_cost_map lru_cache in the autouse fixture alongside get_bedrock_response_stream_shape to avoid stale cache leakage between tests. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(bedrock): pre-clamp reasoning_effort for chat invoke; correct test caps - Add _clamp_adaptive_reasoning_effort_for_bedrock to AmazonAnthropicClaudeConfig so raw reasoning_effort=xhigh degrades to the model's bedrock effort ceiling before AnthropicConfig.map_openai_params converts it to output_config. Mirrors converse path (_handle_reasoning_effort_parameter) and messages path (_clamp_adaptive_reasoning_effort_for_bedrock) so the three Bedrock paths are consistent. - grid_spec: restore caps=_CAPS_4_6 for Bedrock converse/invoke Opus 4.6 entries so the test reflects the model's actual JSON capabilities. Teach expected() to bypass the xhigh/max cap check when bedrock_effort_ceiling will clamp the wire effort, so the test still passes for Bedrock's graceful degradation contract without lying about native model caps. Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Dennis Henry <dennis.henry@okta.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * feat(guardrails): wire apply_guardrail into proxy logging callbacks (BerriAI#28970) * feat(guardrails): wire apply_guardrail into proxy logging callbacks Route /apply_guardrail through pre/post proxy hooks and LiteLLM success/failure handlers so Langfuse and OTEL integrations receive input/output on guardrail-only requests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(guardrails): fix Greptile review comments on apply_guardrail logging Co-authored-by: Cursor <cursoragent@cursor.com> * fix(apply_guardrail): preserve original exception and capture modified response - Capture return value from post_call_success_hook so callback-modified responses propagate to the caller. - Wrap success/failure logging calls in defensive try/except so logging infrastructure failures don't replace the user-visible response or mask the original guardrail exception. Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix mypy * fix(apply_guardrail): isolate failure logging and use post-hook response for logging - Split async_failure_handler and post_call_failure_hook into independent try/except blocks so a callback bug in one does not silently skip the other. - Build response_for_logging inside _emit_guardrail_success_logs after post_call_success_hook runs, so logged data matches the response the caller actually receives when the hook modifies the response. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(apply_guardrail): fix black formatting and update tests for fastapi_request param - Run black on guardrail_endpoints.py to fix CI formatting check - Add _mock_proxy_logging() helper to enterprise guardrail tests to patch proxy-server globals imported at call time - Pass fastapi_request=Mock() in all direct apply_guardrail test calls to match updated function signature Co-authored-by: Cursor <cursoragent@cursor.com> * fix(guardrails): use transformed exception from post_call_failure_hook in apply_guardrail Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(guardrails): isolate sync/async logging handlers in apply_guardrail Separate each logging handler call into its own try/except so a failure in the async handler does not silently skip the sync handler submission (and vice versa). Matches the docstring's defensive intent. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(apply_guardrail): guard transformed_exception with isinstance check Co-authored-by: Cursor <cursoragent@cursor.com> * test(guardrails): mock proxy globals in not_found test and share apply_guardrail logging fixture - Add proxy-server global mocks to test_apply_guardrail_not_found so the failure-path post_call_failure_hook call doesn't touch the real proxy logging singleton. - Extract the duplicated _mock_proxy_logging context manager out of the two enterprise apply_guardrail test files into a shared conftest fixture so the helper stays in one place. * fix(guardrails): use update_messages to keep logging obj in sync Co-authored-by: Yassin Kortam <yassin@berri.ai> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: mateo-berri <277851410+mateo-berri@users.noreply.github.com> * chore(ci): merge dev brach (BerriAI#29192) * build(deps): bump next from 16.2.4 to 16.2.6 in /ui/litellm-dashboard (BerriAI#27665) Bumps [next](https://github.com/vercel/next.js) from 16.2.4 to 16.2.6. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](vercel/next.js@v16.2.4...v16.2.6) --- updated-dependencies: - dependency-name: next dependency-version: 16.2.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump protobufjs in /tests/pass_through_tests (BerriAI#28296) Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.5.6 to 7.6.0. - [Release notes](https://github.com/protobufjs/protobuf.js/releases) - [Changelog](https://github.com/protobufjs/protobuf.js/blob/protobufjs-v7.6.0/CHANGELOG.md) - [Commits](protobufjs/protobuf.js@protobufjs-v7.5.6...protobufjs-v7.6.0) --- updated-dependencies: - dependency-name: protobufjs dependency-version: 7.6.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump ws from 8.20.0 to 8.20.1 in /tests/pass_through_tests (BerriAI#28303) Bumps [ws](https://github.com/websockets/ws) from 8.20.0 to 8.20.1. - [Release notes](https://github.com/websockets/ws/releases) - [Commits](websockets/ws@8.20.0...8.20.1) --- updated-dependencies: - dependency-name: ws dependency-version: 8.20.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: improve bedrock streaming hot path perf (BerriAI#28720) * fix(proxy): enforce tag budgets for key-level tags (BerriAI#29108) * fix(proxy): enforce tag budgets for key-level tags Merge API key metadata.tags into request_data before _tag_max_budget_check so per-tag budgets apply when tags are set on the key at creation time. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(auth): avoid false reject for key-inherited tags Run reject_clientside_metadata_tags before key-tag injection, then inject key metadata tags immediately before tag budget checks so key tags still enforce budgets without being treated as client-supplied tags. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * fix(vertex-ai): use DB credentials in video handlers + implement Veo video edit (BerriAI#29098) * fix(vertex-ai): pass litellm_params to validate_environment in video handlers and implement video edit for Veo - Pass litellm_params to validate_environment in 11 video handler call sites (remix, create_character, get_character, edit, extension, delete) so DB-stored Vertex AI credentials are used instead of falling back to ADC - Implement transform_video_edit_request/response for VertexAI: fetches source video via fetchPredictOperation then submits a new predictLongRunning request with the video bytes/gcsUri + edit prompt Co-authored-by: Cursor <cursoragent@cursor.com> * fix(vertex-ai): hoist fetchPredictOperation into handlers to avoid blocking event loop - Add get_video_edit_prefetch_params() to BaseVideoConfig (returns None) - VertexAI overrides it to return the fetchPredictOperation URL/body - Both sync and async video_edit handlers call this and use their shared httpx client for the fetch, passing the result as prefetched_source_data - transform_video_edit_request is now a pure transform with no HTTP calls - Fix extra_body.pop() mutation by working on a shallow copy Co-authored-by: Cursor <cursoragent@cursor.com> * fix(vertex-ai): include prefetch call inside _handle_error try/except block Co-authored-by: Cursor <cursoragent@cursor.com> * fix(videos): add prefetched_source_data param to all transform_video_edit_request overrides Co-authored-by: Cursor <cursoragent@cursor.com> * fix(video_edit): keep transform/pre_call outside try so validation errors propagate Move transform_video_edit_request and logging_obj.pre_call outside the try/except that wraps HTTP calls in (async_)video_edit_handler so that ValueError validation errors (e.g. 'source video not complete yet') are not silently wrapped as 500s by _handle_error. The prefetch HTTP call keeps its own try/except so its errors are still mapped through the provider's error handler. Matches the pattern used by video_extension_handler and video_remix_handler. Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(vertex_ai): delegate get_video_edit_prefetch_params to status retrieve Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix varia review * fix(video_edit): route transform errors through _handle_error Wrap transform_video_edit_request and pre_call in the same try/except as the HTTP call in sync and async handlers so validation failures (e.g. source video not complete) return typed LiteLLM exceptions. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(datadog): drain cost-management queue + opt-in FinOps tag allowlist (BerriAI#28487) * fix(datadog): drain cost-management queue + opt-in FinOps tag allowlist * fix(datadog): guard non-dict callback_specific_params + log empty aggregation * fix(datadog): block user-controlled tags from overwriting reserved cost-attribution dimensions * fix(datadog): cast metadata to dict[str, Any] to satisfy mypy * feat(helm): split per-component ServiceAccounts for gateway, backend, and UI (BerriAI#28712) * feat(helm): split per-component ServiceAccounts for gateway, backend, and UI Replace the single shared serviceAccount with three separate serviceAccounts (gateway, backend, ui) so operators can attach different IRSA / Workload Identity annotations per component without granting data-plane credentials to the UI pod. Key changes: - values.yaml: rename serviceAccount → serviceAccounts with gateway/backend/ui sub-keys; UI defaults to automount: false - _helpers.tpl: replace litellm.serviceAccountName with three component-scoped helpers (litellm.gateway/backend/ui.serviceAccountName) - serviceaccount.yaml: create up to three separate ServiceAccount objects with component labels and per-SA automountServiceAccountToken - gateway/backend deployments: use their respective SA helpers - ui deployment: use litellm.ui.serviceAccountName + explicit automountServiceAccountToken: false on the pod spec so the projected token is absent even when the SA itself allows it - migrations-job: share the backend SA (both need DB write access) Resolves LIT-3171 https://claude.ai/code/session_01QPy362WnjmEpeNuJaPUqmF * fix(helm): enforce automountServiceAccountToken on all pod specs; fix leading --- in serviceaccount.yaml - gateway/backend deployments: add explicit automountServiceAccountToken on the pod spec so serviceAccounts.*.automount is honoured regardless of whether the SA is chart-created or operator-supplied (previously the flag only took effect on the SA object when create: true, creating an asymmetry with the UI which already enforced it at pod-spec level) - serviceaccount.yaml: use a $prev sentinel to emit --- only between documents, preventing a leading --- when gateway SA is skipped but backend or ui SA is created (avoids lint/GitOps warnings from strict YAML parsers and tools like ArgoCD) https://claude.ai/code/session_01QPy362WnjmEpeNuJaPUqmF --------- Co-authored-by: Claude <noreply@anthropic.com> * bump deps (BerriAI#29208) (BerriAI#29226) * fix(deps): bump vulnerable proxy dependencies (starlette/fastapi, granian, pyarrow, semantic-router) Resolve known CVEs flagged by osv-scanner/grype against uv.lock. All bumped versions verified to resolve, install, and pass the proxy auth/route/middleware unit suites (717 tests) plus an import smoke on the new stack. - starlette 0.50.0 -> 1.1.0 (CVE-2026-48710 "BadHost", GHSA-86qp-5c8j-p5mr): versions <1.0.1 reconstruct request.url from the unvalidated Host header, poisoning request.url.path. Required raising fastapi 0.124.4 -> 0.136.3, which dropped fastapi's starlette<0.51.0 cap; an explicit starlette>=1.0.1 floor blocks regression to a vulnerable transitive resolution. The proxy's own auth already reads scope["path"] via get_request_route, but the locked starlette still flagged in container scanners and left other request.url consumers exposed. - granian 2.5.7 -> 2.7.4 (CVE-2026-42544, unauthenticated DoS via WebSocket subprotocol header panic; CVE-2026-42545, WSGI response-header-panic DoS). granian is a selectable proxy server (proxy_cli). - pyarrow 22.0.0 -> 23.0.1 (CVE-2026-25087 / PYSEC-2026-113). - semantic-router 0.1.12 -> 0.1.15: 0.1.12 was yanked (CVE-2026-42208 — its unbounded litellm pin could resolve a credential-exfiltrating litellm==1.82.8 wheel). Not fixable by bump: diskcache 5.6.3 (CVE-2025-69872, unsafe pickle deserialization) has no upstream fix and is left pinned; exploiting it requires write access to the local cache directory. Relock side effect: sse-starlette 3.4.2 -> 3.4.4. * deps: relax exact pins in optional extras to compatible ranges The proxy/optional extras exact-pinned every dependency, which (1) forces downstream `pip install litellm[proxy]` consumers into version lockstep and (2) blocks them from pulling transitive security patches without forking — the structural cause behind needing a litellm release to clear the starlette CVE in the previous commit. Convert the ordinary extras deps to `>=current,<next_major` ranges, mirroring the core [project].dependencies style. Reproducibility for litellm's own Docker/CI is unaffected: images install via `uv sync --frozen`, and the lock re-resolves to the identical versions (no locked version changed). Kept exact-pinned: - litellm-proxy-extras, litellm-enterprise — litellm's own sub-packages, versioned in lockstep with the release. - opentelemetry-api/sdk/exporter-otlp — must resolve to matching versions. - grpcio — supply-chain-pinned to a vetted, aged release. Also corrects the stale comment claiming the extras are exact-pinned for Docker reproducibility (the images use the lock, not these pins). * fix(ci): resolve license-check lookup version from the floor for ranged deps check_licenses.py derived the PyPI lookup version with `next(iter(req.specifier))`, which returns an arbitrary specifier clause. For a range like `>=0.12.1,<1.0` it picked the upper bound (`1.0`) — a version that doesn't exist on PyPI — so the license lookup 404'd and the package was flagged as having an unknown license. The previous commit's switch from exact pins to ranges exposed this for soundfile, pyroscope-io, redisvl, diskcache, and mlflow (the ranged deps not already in liccheck.ini's allowlist). Prefer a lower-bound/exact version (a real released version) for the lookup. * fix(proxy): set strict_content_type=False on the FastAPI app Starlette 1.0 / FastAPI 0.13x flipped the default to strict_content_type=True, which refuses to parse a JSON request body when the client omits the Content-Type header. The proxy previously accepted those requests, so the fastapi/starlette bump in this PR would silently break clients that don't send a Content-Type. Restore the prior lenient behavior explicitly. Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> * fix(tests/vcr): mint Google OAuth tokens live to prevent stale-token replay (BerriAI#29229) The Redis-backed VCR layer was recording and replaying the Google OAuth2/STS token-mint call. The replayed ya29.* access token is long-expired, but its recorded expires_in keeps credentials.expired False, so litellm never refreshes it and sends the stale token to a live Vertex/Gemini endpoint, which returns 401 ACCESS_TOKEN_EXPIRED. This broke live partner-model tests whose completion call is not itself cassette-backed (e.g. test_vertex_ai_llama_tool_calling). Force credential-exchange hosts to pass through live (never recorded, never replayed) by returning None from before_record_request, mirroring the existing telemetry passthrough, so a fresh token is minted each run. Regression from BerriAI#28826, which added OAuth-token matcher tolerance plus TTL-refresh-on-read so a stale token episode matched and never expired. * chore(cookbook): bump Go directive to 1.26.3 in gollem example (BerriAI#29234) Updates the gollem_go_agent_framework example to the current Go release. Clears stale Go stdlib advisories reported by osv-scanner against the older 1.25.1 directive. No source changes; the single pinned dependency (gollem v0.1.0) is backward compatible. * chore(ci): bump version (BerriAI#29242) * bump: version 1.87.0 → 1.88.0 * uv lock * feat(anthropic): add Claude Opus 4.8 and prune reasoning-effort flags (BerriAI#29238) * feat(anthropic): add Claude Opus 4.8 and prune reasoning-effort flags Register claude-opus-4-8 across the anthropic/bedrock/vertex/azure cost-map entries, BEDROCK_CONVERSE_MODELS, and the setup-wizard provider list. Prune two reasoning-effort fields from the cost map: - Drop supports_minimal_reasoning_effort from the Claude fleet (58 entries). "minimal" is not a real Anthropic effort level (the API accepts only low/medium/high/xhigh/max), so LiteLLM degrades it to "low" regardless; the flag was inert and misleading on Anthropic. - Remove tool_use_system_prompt_tokens everywhere (103 entries). It is not in the ModelInfo type and is read by no production code. Update the affected config/schema tests; the reasoning-effort registry tests now assert the Claude fleet omits supports_minimal. * fix(anthropic): recognize output_config effort after minimal-flag prune Pruning supports_minimal_reasoning_effort from the Claude fleet removed the only "supports effort param" marker from 11 Opus 4.5 / mythos-preview map entries that lack supports_output_config. _model_supports_effort_param then returned False for them, so output_config was wrongly dropped under drop_params=True -- regressing test_anthropic_model_supports_effort_param_recognizes_supporting_models for claude-opus-4-5-20251101 and the mythos preview. - _model_supports_effort_param now treats supports_output_config as a sufficient signal, matching the bedrock-invoke call sites that already check supports_output_config OR a reasoning-effort flag. Shared map lookup extracted into _supports_model_capability. - Add supports_output_config: true to the 11 Opus 4.5 / mythos entries that lost their only marker, restoring prior effort-forwarding behavior without re-adding the inert minimal flag. * fix(ci): restore real Bedrock batch S3 bucket and role in oai_misc_config (BerriAI#29245) The OSS-staging sync (d52fbfb) overwrote the Bedrock batch model's s3_bucket_name and aws_batch_role_arn with public-safe placeholders (account 123456789012 / *_EXAMPLE role). The e2e_openai_endpoints CI job runs the proxy with AWS account 941277531214 credentials, so on file upload test_bedrock_batches_api failed with: NoSuchBucket: The specified bucket does not exist <BucketName>litellm-proxy-123456789012</BucketName> Restore the real resources that live in account 941277531214 (verified to exist) — the same values tests/batches_tests/test_bedrock_files_and_batches.py already references. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(guardrails): persist disable_global_guardrails on keys (BerriAI#29233) * fix(guardrails): restore disable_global_guardrails persistence for keys The per-key/team "Disable Global Guardrails" toggle silently stopped working after BerriAI#17042, which removed `disable_global_guardrails` from the key/team request models and from the premium metadata allowlist. Without those, the UI's top-level field was dropped by pydantic and never folded into key `metadata`, so the runtime gate always read False and global default_on guardrails kept running. Restore the request-model fields (KeyRequestBase, NewTeamRequest, UpdateTeamRequest) and the `LiteLLM_ManagementEndpoint_MetadataFields_Premium` entry so the flag is promoted into metadata again. Because the key edit form always submits the flag (false by default), guard the UI so it is only sent when it actually changed (edit) or is enabled (create) — this keeps the premium gate on enabling intact while not 403-ing non-premium users who edit unrelated key fields, mirroring how guardrails/tags are already stripped. * test(guardrails): cover disable_global_guardrails toggle-off + clarify premium field comment Add a prepare_metadata_fields case asserting `disable_global_guardrails: False` overwrites an existing `True`, and rewrite the PREMIUM_METADATA_FIELDS comment to explain why boolean premium fields are excluded from the empty-value strip loop. * test(e2e): cover Team Admin view + member + key flows (BerriAI#29072) * test(e2e): cover Team Admin view + member + key flows Adds a new spec exercising the previously-uncovered team-admin manual-QA items: viewing all team keys (including other members'), adding a member, removing a member, and creating a team key with All Team Models. Also seeds a dedicated invitee user so the add-member test can run in parallel with the proxy-admin invite test without colliding on the team roster. * test(e2e): harden team-admin member specs per review feedback Address Greptile feedback on the Team Admin spec: - locate the delete action via getByTestId("delete-member") instead of the fragile svg/img .last() selector - match the seeded removable member by user_id (members_with_roles stores no email, so the roster renders user_id) - assert exact success-toast strings rather than broad regexes that could match unrelated "success" text * docs: hand-written CLAUDE.md; point GEMINI.md and AGENTS.md at it (BerriAI#29252) * docs: replace generated CLAUDE.md with hand-written guidance, remove AGENTS.md Swap the auto-generated CLAUDE.md for a concise hand-written version that captures how we actually want agents to work in this repo: minimal comments, simplicity first, meaningful tests with a high mutation kill rate, PRs based off litellm_internal_staging rather than main, and curl against a live proxy as proof of fix instead of pasted pytest output. Remove AGENTS.md so there is one source of truth for agent guidance. The customer and company name confidentiality policy, along with the MCP available_on_public_internet note, are carried over from the previous CLAUDE.md. * fix: further clarify communication guidelines * docs: point GEMINI.md at CLAUDE.md instead of duplicating guidance Replace the standalone GEMINI.md copy, which had already drifted from the new CLAUDE.md, with a one-line pointer so Gemini reads the same single source of truth. * docs: simplify PR template test checklist item Replace the rigid "at least 1 test is a hard requirement" checklist line with "I have added meaningful tests", which matches the testing guidance in CLAUDE.md, and tidy a comma into a semicolon in the scope-isolation item. * docs: point AGENTS.md at CLAUDE.md instead of deleting it Keep AGENTS.md so tools that read it still resolve guidance, but collapse it to the same one-line pointer to CLAUDE.md used by GEMINI.md, keeping a single source of truth. * fix: make AI-generated rules more concise * fix: spelling Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix: make the .env usage more careful * docs: restore MCP available_on_public_internet note to CLAUDE.md The PR description states this note was carried over verbatim from the previous CLAUDE.md, but it was dropped in the rewrite. Restore it so the file matches the description and the team guidance is not lost. * docs: restore browser storage and CI supply-chain safety notes to CLAUDE.md These security-relevant rules were dropped in the rewrite. Restore the sessionStorage-over-localStorage (XSS) guidance and the CI supply-chain rules (no curl|bash, pin versions, verify checksums) so agents editing UI or CI code are still steered away from those pitfalls. * docs: move area-specific guidance into nested CLAUDE.md files The MCP, browser-storage, and CI supply-chain notes are scoped to particular parts of the tree, so move each into a nested CLAUDE.md that Claude Code loads on demand when those files are touched: the MCP note under the mcp_server gateway, the browser-storage rule under the UI dashboard, and the CI supply-chain rules under .circleci. Keeps the root CLAUDE.md focused on general guidance while the area notes surface where they are relevant. * docs: keep CI supply-chain note in root CLAUDE.md CI guidance applies beyond .circleci (it also covers downloads in GitHub workflows and any CI script), and CI work does not reliably touch a single subtree, so a nested file under .circleci would not surface it dependably. Keep it in the always-loaded root instead. The MCP and browser-storage notes stay nested where they map cleanly to one area of the tree. * fix: make it clear we prefer httpOnly * chore: make ci rule more concise * chore: make concise Fix formatting and punctuation in MCP note. * fix: don't include Claude attribution --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix: regenerate uv.lock to sync with pyproject.toml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Mateo Wang <277851410+mateo-berri@users.noreply.github.com> Co-authored-by: Dennis Henry <dennis.henry@okta.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Sameer Kankute <sameer@berri.ai> Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: michelligabriele <gabriele.michelli@icloud.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: stuxf <70670632+stuxf@users.noreply.github.com> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

github-advanced-security AI found potential problems May 23, 2026

View reviewed changes

Comment thread .github/workflows/codeql.yml

output: sarif-results/python.sarif

- name: Upload SARIF

uses: github/codeql-action/upload-sarif@ebcb5b36ded6beda4ceefea6a8bc4cc885255bb3 # v3

github-advanced-security AI found potential problems May 23, 2026

View reviewed changes

yassin-berriai changed the base branch from main to litellm_internal_staging May 25, 2026 16:35

greptile-apps Bot reviewed May 25, 2026

View reviewed changes

Comment thread litellm/litellm_core_utils/streaming_handler.py

Comment thread litellm/litellm_core_utils/streaming_handler.py

yassin-berriai force-pushed the litellm_fix/LIT-3313-streaming-chunk-overhead branch from ef9f75d to 3104e3f Compare May 28, 2026 16:41

yassin-berriai marked this pull request as ready for review May 28, 2026 16:55

yassin-berriai enabled auto-merge (squash) May 28, 2026 16:56

fix: improve bedrock streaming hot path perf

bfb7ea1

yassin-berriai force-pushed the litellm_fix/LIT-3313-streaming-chunk-overhead branch from 3104e3f to bfb7ea1 Compare May 28, 2026 16:56

yuneng-berri approved these changes May 28, 2026

View reviewed changes

yassin-berriai merged commit d5d6b26 into litellm_internal_staging May 28, 2026
104 of 118 checks passed

christianherweg0807 mentioned this pull request Jun 9, 2026

bug(streaming): fast_path in async_streaming_data_generator breaks tool-call continuation — client receives XML instead of text (introduced v1.87.0, PR #28289) #30053

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(streaming): cut per-chunk overhead ~30% on Anthropic + Bedrock hot path#28720

perf(streaming): cut per-chunk overhead ~30% on Anthropic + Bedrock hot path#28720
yassin-berriai merged 1 commit into
litellm_internal_stagingfrom
litellm_fix/LIT-3313-streaming-chunk-overhead

yassin-berriai commented May 23, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 23, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 23, 2026 •

edited

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

codecov Bot commented May 23, 2026 •

edited

Loading

Uh oh!

yassin-berriai commented May 25, 2026

Uh oh!

greptile-apps Bot commented May 25, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

yassin-berriai commented May 25, 2026

Uh oh!

greptile-apps Bot commented May 25, 2026

Uh oh!

yassin-berriai commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yassin-berriai commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Root Cause: Sync Path Bug (Optimization 1)

Benchmark

model_response_creator() microbenchmark

End-to-end stream-iteration benchmark

Behavioral test matrix

Test plan

Uh oh!

CLAassistant commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yassin-berriai commented May 25, 2026

Uh oh!

greptile-apps Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

yassin-berriai commented May 25, 2026

Uh oh!

greptile-apps Bot commented May 25, 2026

Uh oh!

yassin-berriai commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yassin-berriai commented May 23, 2026 •

edited

Loading

`model_response_creator()` microbenchmark

CLAassistant commented May 23, 2026 •

edited

Loading

codspeed-hq Bot commented May 23, 2026 •

edited

Loading

codecov Bot commented May 23, 2026 •

edited

Loading

greptile-apps Bot commented May 25, 2026 •

edited

Loading