Skip to content

perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths#28289

Merged
yassin-berriai merged 3 commits into
litellm_internal_stagingfrom
litellm_fix/optimize-v1-messages-streaming
May 23, 2026
Merged

perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths#28289
yassin-berriai merged 3 commits into
litellm_internal_stagingfrom
litellm_fix/optimize-v1-messages-streaming

Conversation

@yassin-berriai

@yassin-berriai yassin-berriai commented May 19, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Reduces per-request and per-chunk overhead on the Anthropic /v1/messages streaming path in the LiteLLM proxy. Same wire output, parity-tested.

  • Skip work that's a no-op in the default config. Don't run the per-chunk Datadog span when tracing is off, don't await the per-chunk streaming hook when no callback / guardrail / cost-injection is active, and skip the agentic post-processing wrapper when no callback overrides its hook (it otherwise buffers every chunk and rebuilds the response from SSE only to call hooks that all return (False, {})).
  • Stop doing the same work twice per request. Serialize the request body once (reuse for pre-call log + wire), memoize the optional-params type-hints resolution (~80µs/request), skip the second strip_empty_text_blocks scan when the async wrapper already sanitized.
  • Cheaper end-of-stream response reconstruction. Collapse the homogeneous run of content_block_delta text events into a single equivalent SSE event before stream_chunk_builder, removing O(num_output_tokens) ModelResponseStream Pydantic constructions. Tool-use / thinking / citations streams fall back to the unchanged legacy path.
  • Cheaper logging on the hot path. Gate debug f-string evaluation behind isEnabledFor(DEBUG) (no more serializing full message payloads at non-debug levels), hoist cost_injection_active out of the per-chunk loop, drop one async-generator layer per chunk in async_sse_data_generator.
  • Parity-guaranteed. Every fast path falls back to the legacy path for anything it doesn't recognize. New parity tests assert byte-identical logged/billed payloads between fast and legacy paths, plus unit tests for agentic-hook detection, pre-serialized body reuse, and memoized key resolution.
  • Reproducible benchmark. scripts/benchmark_anthropic_messages_perf.py boots a local mock Anthropic SSE provider + the proxy under test for apples-to-apples TTFT / TPM measurements across commits.

Benchmark

Local scripts/benchmark_anthropic_messages_perf.py against the mock Anthropic SSE provider on the same host. 500 requests/run, concurrency 20, 80-request warmup, median of 5 runs, back-to-back on the same machine. Baseline = HEAD^ (this PR reverted), Optimized = HEAD (this PR).

Metric Baseline Optimized Δ
TTFT p50 (ms) 241.88 90.89 −62.4% (2.66× faster)
TTFT p95 (ms) 463.86 148.23 −68.0% (3.13× faster)
TTFT p99 (ms) 1313.46 155.77 −88.1% (8.43× faster)
Full-request p50 (ms) 242.26 91.32 −62.3%
Full-request p95 (ms) 464.24 148.88 −67.9%
Output tokens / s 4,394.5 12,504.4 +184.6% (2.85×)
Requests / s 68.66 195.38 +184.6% (2.85×)

Reproduce:

uv run python scripts/benchmark_anthropic_messages_perf.py \
    --label optimized --proxy-port 4099 --provider-port 8098 \
    --requests 500 --concurrency 20 --warmup 80 --repeats 5

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Type

🚄 Infrastructure
🧹 Refactoring
✅ Test

Changes

@CLAassistant

CLAassistant commented May 19, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ yassinkortam
❌ claude
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

@greptile-apps

greptile-apps Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR reduces per-request and per-chunk overhead on the Anthropic /v1/messages streaming path by removing several layers of redundant work in the default (no-callbacks, no-tracing) configuration.

  • Fast-path chunk collapsing (_collapse_pure_text_chunks): N per-token Pydantic ModelResponseStream constructions are replaced by a single merged event before stream_chunk_builder; parity tests assert byte-identical logged/billed payloads between paths, with automatic fallback for tool-use, thinking, and citations streams.
  • Per-request deduplication: request-body serialization reused across pre-call log and wire, get_type_hints memoized with @lru_cache, and redundant strip_empty_text_blocks scan skipped via _litellm_messages_presanitized sentinel.
  • Per-chunk hot-path gates: DD tracer context-manager overhead skipped via an import-time constant, per-chunk async_post_call_streaming_hook coroutine bypassed when no callbacks/guardrails are registered, cost_injection_active hoisted out of the chunk loop, and one async for: yield trampoline removed from async_sse_data_generator.

Confidence Score: 5/5

Safe to merge. All fast paths fall back to the unchanged legacy path for anything they don't recognise, and parity tests assert byte-identical logged/billed payloads between the two paths.

The changes are well-scoped and well-tested. Every new fast path has a concrete fallback, and the parity test suite directly guards the most complex optimization (chunk collapsing). The two notes left are minor: one is about a hypothetical hook mutation pattern that is unsupported by convention, the other is an import-time constant that the tests already patch correctly.

No files require special attention; the fast-path logic in anthropic_passthrough_logging_handler.py and llm_http_handler.py is the most complex but is covered by dedicated parity and retry tests.

Important Files Changed

Filename Overview
litellm/proxy/pass_through_endpoints/llm_provider_handlers/anthropic_passthrough_logging_handler.py Adds fast-path chunk collapsing via _collapse_pure_text_chunks that merges N text-delta events into a single event before legacy reconstruction, plus _build_complete_streaming_response_legacy refactor. Parity tests confirm byte-identical outputs.
litellm/llms/custom_httpx/llm_http_handler.py Pre-serializes request body once (reused for pre-call log and wire), adds _has_agentic_completion_hook to skip agentic wrapper when no override present, changes signed_json_body type to Union[str, bytes].
litellm/proxy/common_request_processing.py Adds _DD_STREAMING_TRACE_ENABLED module-level constant to skip DD span overhead per-chunk when tracing is off; adds fast-path in async_streaming_data_generator bypassing per-chunk hook; removes one generator layer from async_sse_data_generator.
litellm/proxy/pass_through_endpoints/streaming_handler.py Hoists cost_injection_active out of the per-chunk loop; extracts _build_passthrough_logging_result for testability; refactors the hot path to avoid branching per-chunk.
litellm/proxy/utils.py Changes async_post_call_streaming_hook detection from leaf-class dict check to MRO walk via function-identity comparison, ensuring inherited overrides are detected.
litellm/llms/anthropic/experimental_pass_through/messages/handler.py Passes _litellm_messages_presanitized=True sentinel from async wrapper to handler to skip redundant strip_empty_text_blocks scan; handler pops the flag to prevent provider leakage.
litellm/llms/anthropic/experimental_pass_through/messages/utils.py Memoizes get_type_hints(AnthropicMessagesRequestOptionalParams) via @lru_cache(maxsize=1), removing ~80us per-request cost.
scripts/benchmark_anthropic_messages_perf.py New benchmark script booting a mock Anthropic SSE provider and LiteLLM proxy for apples-to-apples TTFT/TPM measurements across commits.
tests/test_litellm/proxy/pass_through_endpoints/llm_provider_handlers/test_anthropic_passthrough_logging_handler.py Adds TestPureTextFastPathParity with 10 scenarios asserting byte-identical usage/logged payload between fast and legacy paths; adds fallback tests for tool-use and thinking blocks.
tests/test_litellm/llms/custom_httpx/test_llm_http_handler.py Adds tests for _has_agentic_completion_hook detection and three tests for the pre-serialized body reuse/retry path.
tests/test_litellm/proxy/test_common_request_processing.py Adds DD-trace skip test and TestAsyncStreamingDataGeneratorFastPath covering both fast and slow paths of the per-chunk hook gate.

Reviews (4): Last reviewed commit: "fix(mypy): narrow model_name to str in c..." | Re-trigger Greptile

Comment thread litellm/llms/custom_httpx/llm_http_handler.py
Comment thread litellm/proxy/pass_through_endpoints/streaming_handler.py Outdated
@yassin-berriai yassin-berriai marked this pull request as draft May 19, 2026 22:10
@yassin-berriai yassin-berriai force-pushed the litellm_fix/optimize-v1-messages-streaming branch 2 times, most recently from 28cfe44 to 68f9832 Compare May 19, 2026 22:22
@yassin-berriai

Copy link
Copy Markdown
Contributor Author

@greptileai

1 similar comment
@yassin-berriai

Copy link
Copy Markdown
Contributor Author

@greptileai

yassinkortam and others added 2 commits May 23, 2026 09:25
…aming hot paths

- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution
- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
  events from different block indexes are observed without an intervening
  flush. Anthropic sends blocks strictly sequentially, but defensive bail
  prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
  in `_callback_capabilities` with a function-identity comparison that
  walks the MRO. A vendor base class can carry the override and the
  registered class can add nothing else; before this PR the hook was
  unconditionally invoked, so an inherited-override miss would silently
  drop the hook on the streaming path.
- Add unit tests for both behaviors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yassin-berriai yassin-berriai force-pushed the litellm_fix/optimize-v1-messages-streaming branch from 99aa931 to 04412dc Compare May 23, 2026 16:27
@yassin-berriai yassin-berriai marked this pull request as ready for review May 23, 2026 16:44
The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yassin-berriai yassin-berriai enabled auto-merge (squash) May 23, 2026 17:29
@yassin-berriai yassin-berriai merged commit 2eab9ee into litellm_internal_staging May 23, 2026
111 of 120 checks passed
yassin-berriai pushed a commit that referenced this pull request May 25, 2026
…LIT-3313)

Adds two scripts under scripts/ that drive CustomStreamWrapper directly
with synthetic in-memory chunks, isolating the per-chunk hot path that
this PR optimises:

* benchmark_model_response_creator.py — tight call loop over
  model_response_creator() where 3 of the 5 optimisations live
  (cached model name, pre-computed _base_hidden_params, single dict
  spread).
* benchmark_streaming_chunk_overhead.py — full sync/async iteration
  across Anthropic GChunk, Bedrock Invoke GChunk, and Bedrock Converse
  ModelResponseStream streams.

A full proxy benchmark like scripts/benchmark_anthropic_messages_perf.py
(used in #28289) would include FastAPI + HTTP + TCP latency, which
dilutes the signal from per-chunk CPU work. Both new scripts isolate
the wrapper's hot path so commits can be compared apples-to-apples.

Usage:
  uv run python scripts/benchmark_model_response_creator.py \\
      --label optimized --iterations 200000 --warmup 2 --repeats 6
  uv run python scripts/benchmark_streaming_chunk_overhead.py \\
      --label optimized --streams 30 --chunks 500 --warmup 2 --repeats 4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
songkuan-zheng added a commit to GhishaDev/litellm that referenced this pull request Jun 4, 2026
Layer 1 of the post-bump audit. E2E mock-only run found 3 fails. Root
causes were independent; one is a real production regression (case 20),
the other two are e2e-harness wiring.

After this commit: `e2e/tools/run-all-cases --mock-only` reports
15 PASS / 0 FAIL / 7 SKIP (skips are Tier=real cases that require a
real provider).

## 1. case 07 metrics-endpoint-smoke (e2e config)

Upstream v1.84.0 flipped the default of
`litellm.require_auth_for_metrics_endpoint` to True. Anonymous
`curl /metrics` now returns 401 in the e2e harness, so the test's
`families=0 ct_ok=0` was a misread auth failure rather than a Prometheus
emission bug.

The case 07 runbook explicitly assumes anonymous /metrics access (the
scrape posture for a trusted-network local Prometheus). Restore that
posture in the e2e config — production unaffected (this only ships in
the autogenerated `e2e/_config/.litellm.rendered.yaml`).

File: `e2e/tools/proxy` — added `require_auth_for_metrics_endpoint: false`
to the rendered `litellm_settings` block.

## 2. case 20 returned_model_name streaming /v1/messages (real bug)

**Production regression.** Wave 5b placed the
`message_start.message.model` SSE rewrite AFTER upstream's
`async_post_call_streaming_hook` call. Upstream PR BerriAI#28289 (in v1.87.0)
introduced a `fast_path` short-circuit before that hook for the dominant
config (no guardrails, default `include_cost_in_streaming_usage`), so
the rewrite was being skipped on every streaming /v1/messages request
where `returned_model_name` is set. The upstream model id leaked.

Fix: move the rewrite block BEFORE the `fast_path` short-circuit. Pay
near-zero overhead in the unset-override case (one dict get + one
substring test in the SSE byte rewriter).

File: `litellm/proxy/common_request_processing.py:2113-2173`.

## 3. case 23 mock-memory-pressure (e2e harness ordering)

Case 23 reset the mock counter without waiting for the mock-side
in_flight to drain. Case 20's last assertion (A4 streaming
/v1/chat/completions) returns `[DONE]` to the client well before the
mock-side handler finishes writing chunks (the mock counter is bumped
post-`wfile.flush()`). The leaked stream from case 20 finished during
case 23's burst window, bumping the counter and producing a false
6-of-5 fail.

Fix: in case 23, poll `/__mock__/state` for `in_flight == 0` (50 × 100ms,
bounded ~5s) before issuing the reset.

File: `e2e/cases/data/23_mock_memory_pressure.sh`.

## Verification

```bash
e2e/tools/run-all-cases --mock-only
# → 15 PASS / 0 FAIL / 7 SKIP (skipped are Tier=real, expected)
```

Tier: C (case 20 — universal bug fix on streaming SSE) + B (case 07, 23
— internal e2e infrastructure).
jgreer013 added a commit to jgreer013/litellm that referenced this pull request Jun 17, 2026
…ages body

Input-callbacks such as Sentry's LiteLLMIntegration store a live Span at litellm_params["metadata"]["_sentry_span"] during pre-call, after the caller's metadata was validated. When that polluted metadata reaches the bare json.dumps(request_body) that builds the Anthropic /v1/messages request, the call crashes with "Object of type ... is not JSON serializable" before the request is sent; this reproduces in the >=1.0.0,<1.82.7 range

Serialize the outbound body through dumps_anthropic_messages_request_body, which drops non-serializable values from metadata rather than crashing; a non-serializable value anywhere else still surfaces as a real error so genuine request bugs are not masked. This makes the failure mode impossible by construction instead of relying on the incidental serialization reordering (BerriAI#28289) that happens to mask it on current main
jgreer013 added a commit to jgreer013/litellm that referenced this pull request Jun 18, 2026
…ages body

Input-callbacks such as Sentry's LiteLLMIntegration store a live Span at litellm_params["metadata"]["_sentry_span"] during pre-call, after the caller's metadata was validated. When that polluted metadata reaches the bare json.dumps(request_body) that builds the Anthropic /v1/messages request, the call crashes with "Object of type ... is not JSON serializable" before the request is sent; this reproduces in the >=1.0.0,<1.82.7 range

Serialize the outbound body through dumps_anthropic_messages_request_body, which drops non-serializable values from metadata rather than crashing; a non-serializable value anywhere else still surfaces as a real error so genuine request bugs are not masked. This makes the failure mode impossible by construction instead of relying on the incidental serialization reordering (BerriAI#28289) that happens to mask it on current main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants