perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths by yassin-berriai · Pull Request #28289 · BerriAI/litellm

yassin-berriai · 2026-05-19T21:49:31Z

What this PR does

Reduces per-request and per-chunk overhead on the Anthropic /v1/messages streaming path in the LiteLLM proxy. Same wire output, parity-tested.

Skip work that's a no-op in the default config. Don't run the per-chunk Datadog span when tracing is off, don't await the per-chunk streaming hook when no callback / guardrail / cost-injection is active, and skip the agentic post-processing wrapper when no callback overrides its hook (it otherwise buffers every chunk and rebuilds the response from SSE only to call hooks that all return (False, {})).
Stop doing the same work twice per request. Serialize the request body once (reuse for pre-call log + wire), memoize the optional-params type-hints resolution (~80µs/request), skip the second strip_empty_text_blocks scan when the async wrapper already sanitized.
Cheaper end-of-stream response reconstruction. Collapse the homogeneous run of content_block_delta text events into a single equivalent SSE event before stream_chunk_builder, removing O(num_output_tokens) ModelResponseStream Pydantic constructions. Tool-use / thinking / citations streams fall back to the unchanged legacy path.
Cheaper logging on the hot path. Gate debug f-string evaluation behind isEnabledFor(DEBUG) (no more serializing full message payloads at non-debug levels), hoist cost_injection_active out of the per-chunk loop, drop one async-generator layer per chunk in async_sse_data_generator.
Parity-guaranteed. Every fast path falls back to the legacy path for anything it doesn't recognize. New parity tests assert byte-identical logged/billed payloads between fast and legacy paths, plus unit tests for agentic-hook detection, pre-serialized body reuse, and memoized key resolution.
Reproducible benchmark. scripts/benchmark_anthropic_messages_perf.py boots a local mock Anthropic SSE provider + the proxy under test for apples-to-apples TTFT / TPM measurements across commits.

Benchmark

Local scripts/benchmark_anthropic_messages_perf.py against the mock Anthropic SSE provider on the same host. 500 requests/run, concurrency 20, 80-request warmup, median of 5 runs, back-to-back on the same machine. Baseline = HEAD^ (this PR reverted), Optimized = HEAD (this PR).

Metric	Baseline	Optimized	Δ
TTFT p50 (ms)	241.88	90.89	−62.4% (2.66× faster)
TTFT p95 (ms)	463.86	148.23	−68.0% (3.13× faster)
TTFT p99 (ms)	1313.46	155.77	−88.1% (8.43× faster)
Full-request p50 (ms)	242.26	91.32	−62.3%
Full-request p95 (ms)	464.24	148.88	−67.9%
Output tokens / s	4,394.5	12,504.4	+184.6% (2.85×)
Requests / s	68.66	195.38	+184.6% (2.85×)

Reproduce:

uv run python scripts/benchmark_anthropic_messages_perf.py \
    --label optimized --proxy-port 4099 --provider-port 8098 \
    --requests 500 --concurrency 20 --warmup 80 --repeats 5

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Type

🚄 Infrastructure
🧹 Refactoring
✅ Test

Changes

CLAassistant · 2026-05-19T21:49:38Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ yassinkortam
❌ claude
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov · 2026-05-19T21:52:32Z

Codecov Report

❌ Patch coverage is 84.71338% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../proxy/pass_through_endpoints/streaming_handler.py	56.25%	14 Missing ⚠️
..._handlers/anthropic_passthrough_logging_handler.py	93.05%	5 Missing ⚠️
litellm/llms/custom_httpx/llm_http_handler.py	85.71%	3 Missing ⚠️
litellm/proxy/common_request_processing.py	90.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-05-19T21:57:04Z

Greptile Summary

This PR reduces per-request and per-chunk overhead on the Anthropic /v1/messages streaming path by removing several layers of redundant work in the default (no-callbacks, no-tracing) configuration.

Fast-path chunk collapsing (_collapse_pure_text_chunks): N per-token Pydantic ModelResponseStream constructions are replaced by a single merged event before stream_chunk_builder; parity tests assert byte-identical logged/billed payloads between paths, with automatic fallback for tool-use, thinking, and citations streams.
Per-request deduplication: request-body serialization reused across pre-call log and wire, get_type_hints memoized with @lru_cache, and redundant strip_empty_text_blocks scan skipped via _litellm_messages_presanitized sentinel.
Per-chunk hot-path gates: DD tracer context-manager overhead skipped via an import-time constant, per-chunk async_post_call_streaming_hook coroutine bypassed when no callbacks/guardrails are registered, cost_injection_active hoisted out of the chunk loop, and one async for: yield trampoline removed from async_sse_data_generator.

Confidence Score: 5/5

Safe to merge. All fast paths fall back to the unchanged legacy path for anything they don't recognise, and parity tests assert byte-identical logged/billed payloads between the two paths.

The changes are well-scoped and well-tested. Every new fast path has a concrete fallback, and the parity test suite directly guards the most complex optimization (chunk collapsing). The two notes left are minor: one is about a hypothetical hook mutation pattern that is unsupported by convention, the other is an import-time constant that the tests already patch correctly.

No files require special attention; the fast-path logic in anthropic_passthrough_logging_handler.py and llm_http_handler.py is the most complex but is covered by dedicated parity and retry tests.

Important Files Changed

Filename	Overview
litellm/proxy/pass_through_endpoints/llm_provider_handlers/anthropic_passthrough_logging_handler.py	Adds fast-path chunk collapsing via _collapse_pure_text_chunks that merges N text-delta events into a single event before legacy reconstruction, plus _build_complete_streaming_response_legacy refactor. Parity tests confirm byte-identical outputs.
litellm/llms/custom_httpx/llm_http_handler.py	Pre-serializes request body once (reused for pre-call log and wire), adds _has_agentic_completion_hook to skip agentic wrapper when no override present, changes signed_json_body type to Union[str, bytes].
litellm/proxy/common_request_processing.py	Adds _DD_STREAMING_TRACE_ENABLED module-level constant to skip DD span overhead per-chunk when tracing is off; adds fast-path in async_streaming_data_generator bypassing per-chunk hook; removes one generator layer from async_sse_data_generator.
litellm/proxy/pass_through_endpoints/streaming_handler.py	Hoists cost_injection_active out of the per-chunk loop; extracts _build_passthrough_logging_result for testability; refactors the hot path to avoid branching per-chunk.
litellm/proxy/utils.py	Changes async_post_call_streaming_hook detection from leaf-class dict check to MRO walk via function-identity comparison, ensuring inherited overrides are detected.
litellm/llms/anthropic/experimental_pass_through/messages/handler.py	Passes _litellm_messages_presanitized=True sentinel from async wrapper to handler to skip redundant strip_empty_text_blocks scan; handler pops the flag to prevent provider leakage.
litellm/llms/anthropic/experimental_pass_through/messages/utils.py	Memoizes get_type_hints(AnthropicMessagesRequestOptionalParams) via @lru_cache(maxsize=1), removing ~80us per-request cost.
scripts/benchmark_anthropic_messages_perf.py	New benchmark script booting a mock Anthropic SSE provider and LiteLLM proxy for apples-to-apples TTFT/TPM measurements across commits.
tests/test_litellm/proxy/pass_through_endpoints/llm_provider_handlers/test_anthropic_passthrough_logging_handler.py	Adds TestPureTextFastPathParity with 10 scenarios asserting byte-identical usage/logged payload between fast and legacy paths; adds fallback tests for tool-use and thinking blocks.
tests/test_litellm/llms/custom_httpx/test_llm_http_handler.py	Adds tests for _has_agentic_completion_hook detection and three tests for the pre-serialized body reuse/retry path.
tests/test_litellm/proxy/test_common_request_processing.py	Adds DD-trace skip test and TestAsyncStreamingDataGeneratorFastPath covering both fast and slow paths of the per-chunk hook gate.

_{Reviews (4): Last reviewed commit: "fix(mypy): narrow model_name to str in c..." | Re-trigger Greptile}

yassin-berriai · 2026-05-19T22:23:11Z

@greptileai

yassin-berriai · 2026-05-19T23:09:55Z

@greptileai

…aming hot paths - Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path - Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config - Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories - Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config) - Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default) - Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each) - Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk - Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch - Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk - Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params) - Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels - Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches - Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution

- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta events from different block indexes are observed without an intervening flush. Anthropic sends blocks strictly sequentially, but defensive bail prevents silent text-merging if the protocol ever interleaves. - Replace leaf-class `__dict__` check for `async_post_call_streaming_hook` in `_callback_capabilities` with a function-identity comparison that walks the MRO. A vendor base class can carry the override and the registered class can add nothing else; before this PR the hook was unconditionally invoked, so an inherited-override miss would silently drop the hook on the streaming path. - Add unit tests for both behaviors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The hoisted cost_injection_active flag in chunk_processor encodes the `bool(model_name)` requirement but mypy can't track that invariant through the local, so the per-chunk `_process_chunk_with_cost_injection( chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed non-None local inside the cost-injection branch so mypy narrows correctly without changing runtime behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…LIT-3313) Adds two scripts under scripts/ that drive CustomStreamWrapper directly with synthetic in-memory chunks, isolating the per-chunk hot path that this PR optimises: * benchmark_model_response_creator.py — tight call loop over model_response_creator() where 3 of the 5 optimisations live (cached model name, pre-computed _base_hidden_params, single dict spread). * benchmark_streaming_chunk_overhead.py — full sync/async iteration across Anthropic GChunk, Bedrock Invoke GChunk, and Bedrock Converse ModelResponseStream streams. A full proxy benchmark like scripts/benchmark_anthropic_messages_perf.py (used in #28289) would include FastAPI + HTTP + TCP latency, which dilutes the signal from per-chunk CPU work. Both new scripts isolate the wrapper's hot path so commits can be compared apples-to-apples. Usage: uv run python scripts/benchmark_model_response_creator.py \\ --label optimized --iterations 200000 --warmup 2 --repeats 6 uv run python scripts/benchmark_streaming_chunk_overhead.py \\ --label optimized --streams 30 --chunks 500 --warmup 2 --repeats 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layer 1 of the post-bump audit. E2E mock-only run found 3 fails. Root causes were independent; one is a real production regression (case 20), the other two are e2e-harness wiring. After this commit: `e2e/tools/run-all-cases --mock-only` reports 15 PASS / 0 FAIL / 7 SKIP (skips are Tier=real cases that require a real provider). ## 1. case 07 metrics-endpoint-smoke (e2e config) Upstream v1.84.0 flipped the default of `litellm.require_auth_for_metrics_endpoint` to True. Anonymous `curl /metrics` now returns 401 in the e2e harness, so the test's `families=0 ct_ok=0` was a misread auth failure rather than a Prometheus emission bug. The case 07 runbook explicitly assumes anonymous /metrics access (the scrape posture for a trusted-network local Prometheus). Restore that posture in the e2e config — production unaffected (this only ships in the autogenerated `e2e/_config/.litellm.rendered.yaml`). File: `e2e/tools/proxy` — added `require_auth_for_metrics_endpoint: false` to the rendered `litellm_settings` block. ## 2. case 20 returned_model_name streaming /v1/messages (real bug) **Production regression.** Wave 5b placed the `message_start.message.model` SSE rewrite AFTER upstream's `async_post_call_streaming_hook` call. Upstream PR BerriAI#28289 (in v1.87.0) introduced a `fast_path` short-circuit before that hook for the dominant config (no guardrails, default `include_cost_in_streaming_usage`), so the rewrite was being skipped on every streaming /v1/messages request where `returned_model_name` is set. The upstream model id leaked. Fix: move the rewrite block BEFORE the `fast_path` short-circuit. Pay near-zero overhead in the unset-override case (one dict get + one substring test in the SSE byte rewriter). File: `litellm/proxy/common_request_processing.py:2113-2173`. ## 3. case 23 mock-memory-pressure (e2e harness ordering) Case 23 reset the mock counter without waiting for the mock-side in_flight to drain. Case 20's last assertion (A4 streaming /v1/chat/completions) returns `[DONE]` to the client well before the mock-side handler finishes writing chunks (the mock counter is bumped post-`wfile.flush()`). The leaked stream from case 20 finished during case 23's burst window, bumping the counter and producing a false 6-of-5 fail. Fix: in case 23, poll `/__mock__/state` for `in_flight == 0` (50 × 100ms, bounded ~5s) before issuing the reset. File: `e2e/cases/data/23_mock_memory_pressure.sh`. ## Verification ```bash e2e/tools/run-all-cases --mock-only # → 15 PASS / 0 FAIL / 7 SKIP (skipped are Tier=real, expected) ``` Tier: C (case 20 — universal bug fix on streaming SSE) + B (case 07, 23 — internal e2e infrastructure).

…ages body Input-callbacks such as Sentry's LiteLLMIntegration store a live Span at litellm_params["metadata"]["_sentry_span"] during pre-call, after the caller's metadata was validated. When that polluted metadata reaches the bare json.dumps(request_body) that builds the Anthropic /v1/messages request, the call crashes with "Object of type ... is not JSON serializable" before the request is sent; this reproduces in the >=1.0.0,<1.82.7 range Serialize the outbound body through dumps_anthropic_messages_request_body, which drops non-serializable values from metadata rather than crashing; a non-serializable value anywhere else still surfaces as a real error so genuine request bugs are not masked. This makes the failure mode impossible by construction instead of relying on the incidental serialization reordering (BerriAI#28289) that happens to mask it on current main

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread litellm/llms/custom_httpx/llm_http_handler.py

Comment thread litellm/proxy/pass_through_endpoints/streaming_handler.py Outdated

yassin-berriai marked this pull request as draft May 19, 2026 22:10

yassin-berriai force-pushed the litellm_fix/optimize-v1-messages-streaming branch 2 times, most recently from 28cfe44 to 68f9832 Compare May 19, 2026 22:22

yassinkortam and others added 2 commits May 23, 2026 09:25

yassin-berriai force-pushed the litellm_fix/optimize-v1-messages-streaming branch from 99aa931 to 04412dc Compare May 23, 2026 16:27

yuneng-berri approved these changes May 23, 2026

View reviewed changes

yassin-berriai marked this pull request as ready for review May 23, 2026 16:44

yassin-berriai enabled auto-merge (squash) May 23, 2026 17:29

yassin-berriai merged commit 2eab9ee into litellm_internal_staging May 23, 2026
111 of 120 checks passed

yassin-berriai mentioned this pull request May 23, 2026

perf(bedrock): skip Pydantic construction for text-delta chunks in Converse + Invoke-Nova streaming #28721

Closed

5 tasks

oss-agent-shin mentioned this pull request May 28, 2026

docs(blog): Anthropic /v1/messages streaming performance improvements BerriAI/litellm-docs#245

Closed

6 tasks

christianherweg0807 mentioned this pull request Jun 9, 2026

bug(streaming): fast_path in async_streaming_data_generator breaks tool-call continuation — client receives XML instead of text (introduced v1.87.0, PR #28289) #30053

Open

jgreer013 mentioned this pull request Jun 17, 2026

fix(anthropic): drop non-serializable callback metadata from /v1/messages body #30669

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths#28289

perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths#28289
yassin-berriai merged 3 commits into
litellm_internal_stagingfrom
litellm_fix/optimize-v1-messages-streaming

yassin-berriai commented May 19, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 19, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

yassin-berriai commented May 19, 2026

Uh oh!

yassin-berriai commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

yassin-berriai commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Benchmark

Relevant issues

Linear ticket

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Uh oh!

CLAassistant commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

yassin-berriai commented May 19, 2026

Uh oh!

yassin-berriai commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yassin-berriai commented May 19, 2026 •

edited

Loading

CLAassistant commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

greptile-apps Bot commented May 19, 2026 •

edited

Loading