perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths#28289
Conversation
|
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR reduces per-request and per-chunk overhead on the Anthropic
Confidence Score: 5/5Safe to merge. All fast paths fall back to the unchanged legacy path for anything they don't recognise, and parity tests assert byte-identical logged/billed payloads between the two paths. The changes are well-scoped and well-tested. Every new fast path has a concrete fallback, and the parity test suite directly guards the most complex optimization (chunk collapsing). The two notes left are minor: one is about a hypothetical hook mutation pattern that is unsupported by convention, the other is an import-time constant that the tests already patch correctly. No files require special attention; the fast-path logic in anthropic_passthrough_logging_handler.py and llm_http_handler.py is the most complex but is covered by dedicated parity and retry tests.
|
| Filename | Overview |
|---|---|
| litellm/proxy/pass_through_endpoints/llm_provider_handlers/anthropic_passthrough_logging_handler.py | Adds fast-path chunk collapsing via _collapse_pure_text_chunks that merges N text-delta events into a single event before legacy reconstruction, plus _build_complete_streaming_response_legacy refactor. Parity tests confirm byte-identical outputs. |
| litellm/llms/custom_httpx/llm_http_handler.py | Pre-serializes request body once (reused for pre-call log and wire), adds _has_agentic_completion_hook to skip agentic wrapper when no override present, changes signed_json_body type to Union[str, bytes]. |
| litellm/proxy/common_request_processing.py | Adds _DD_STREAMING_TRACE_ENABLED module-level constant to skip DD span overhead per-chunk when tracing is off; adds fast-path in async_streaming_data_generator bypassing per-chunk hook; removes one generator layer from async_sse_data_generator. |
| litellm/proxy/pass_through_endpoints/streaming_handler.py | Hoists cost_injection_active out of the per-chunk loop; extracts _build_passthrough_logging_result for testability; refactors the hot path to avoid branching per-chunk. |
| litellm/proxy/utils.py | Changes async_post_call_streaming_hook detection from leaf-class dict check to MRO walk via function-identity comparison, ensuring inherited overrides are detected. |
| litellm/llms/anthropic/experimental_pass_through/messages/handler.py | Passes _litellm_messages_presanitized=True sentinel from async wrapper to handler to skip redundant strip_empty_text_blocks scan; handler pops the flag to prevent provider leakage. |
| litellm/llms/anthropic/experimental_pass_through/messages/utils.py | Memoizes get_type_hints(AnthropicMessagesRequestOptionalParams) via @lru_cache(maxsize=1), removing ~80us per-request cost. |
| scripts/benchmark_anthropic_messages_perf.py | New benchmark script booting a mock Anthropic SSE provider and LiteLLM proxy for apples-to-apples TTFT/TPM measurements across commits. |
| tests/test_litellm/proxy/pass_through_endpoints/llm_provider_handlers/test_anthropic_passthrough_logging_handler.py | Adds TestPureTextFastPathParity with 10 scenarios asserting byte-identical usage/logged payload between fast and legacy paths; adds fallback tests for tool-use and thinking blocks. |
| tests/test_litellm/llms/custom_httpx/test_llm_http_handler.py | Adds tests for _has_agentic_completion_hook detection and three tests for the pre-serialized body reuse/retry path. |
| tests/test_litellm/proxy/test_common_request_processing.py | Adds DD-trace skip test and TestAsyncStreamingDataGeneratorFastPath covering both fast and slow paths of the per-chunk hook gate. |
Reviews (4): Last reviewed commit: "fix(mypy): narrow model_name to str in c..." | Re-trigger Greptile
28cfe44 to
68f9832
Compare
1 similar comment
…aming hot paths
- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution
- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta events from different block indexes are observed without an intervening flush. Anthropic sends blocks strictly sequentially, but defensive bail prevents silent text-merging if the protocol ever interleaves. - Replace leaf-class `__dict__` check for `async_post_call_streaming_hook` in `_callback_capabilities` with a function-identity comparison that walks the MRO. A vendor base class can carry the override and the registered class can add nothing else; before this PR the hook was unconditionally invoked, so an inherited-override miss would silently drop the hook on the streaming path. - Add unit tests for both behaviors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
99aa931 to
04412dc
Compare
The hoisted cost_injection_active flag in chunk_processor encodes the `bool(model_name)` requirement but mypy can't track that invariant through the local, so the per-chunk `_process_chunk_with_cost_injection( chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed non-None local inside the cost-injection branch so mypy narrows correctly without changing runtime behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2eab9ee
into
litellm_internal_staging
…LIT-3313) Adds two scripts under scripts/ that drive CustomStreamWrapper directly with synthetic in-memory chunks, isolating the per-chunk hot path that this PR optimises: * benchmark_model_response_creator.py — tight call loop over model_response_creator() where 3 of the 5 optimisations live (cached model name, pre-computed _base_hidden_params, single dict spread). * benchmark_streaming_chunk_overhead.py — full sync/async iteration across Anthropic GChunk, Bedrock Invoke GChunk, and Bedrock Converse ModelResponseStream streams. A full proxy benchmark like scripts/benchmark_anthropic_messages_perf.py (used in #28289) would include FastAPI + HTTP + TCP latency, which dilutes the signal from per-chunk CPU work. Both new scripts isolate the wrapper's hot path so commits can be compared apples-to-apples. Usage: uv run python scripts/benchmark_model_response_creator.py \\ --label optimized --iterations 200000 --warmup 2 --repeats 6 uv run python scripts/benchmark_streaming_chunk_overhead.py \\ --label optimized --streams 30 --chunks 500 --warmup 2 --repeats 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layer 1 of the post-bump audit. E2E mock-only run found 3 fails. Root causes were independent; one is a real production regression (case 20), the other two are e2e-harness wiring. After this commit: `e2e/tools/run-all-cases --mock-only` reports 15 PASS / 0 FAIL / 7 SKIP (skips are Tier=real cases that require a real provider). ## 1. case 07 metrics-endpoint-smoke (e2e config) Upstream v1.84.0 flipped the default of `litellm.require_auth_for_metrics_endpoint` to True. Anonymous `curl /metrics` now returns 401 in the e2e harness, so the test's `families=0 ct_ok=0` was a misread auth failure rather than a Prometheus emission bug. The case 07 runbook explicitly assumes anonymous /metrics access (the scrape posture for a trusted-network local Prometheus). Restore that posture in the e2e config — production unaffected (this only ships in the autogenerated `e2e/_config/.litellm.rendered.yaml`). File: `e2e/tools/proxy` — added `require_auth_for_metrics_endpoint: false` to the rendered `litellm_settings` block. ## 2. case 20 returned_model_name streaming /v1/messages (real bug) **Production regression.** Wave 5b placed the `message_start.message.model` SSE rewrite AFTER upstream's `async_post_call_streaming_hook` call. Upstream PR BerriAI#28289 (in v1.87.0) introduced a `fast_path` short-circuit before that hook for the dominant config (no guardrails, default `include_cost_in_streaming_usage`), so the rewrite was being skipped on every streaming /v1/messages request where `returned_model_name` is set. The upstream model id leaked. Fix: move the rewrite block BEFORE the `fast_path` short-circuit. Pay near-zero overhead in the unset-override case (one dict get + one substring test in the SSE byte rewriter). File: `litellm/proxy/common_request_processing.py:2113-2173`. ## 3. case 23 mock-memory-pressure (e2e harness ordering) Case 23 reset the mock counter without waiting for the mock-side in_flight to drain. Case 20's last assertion (A4 streaming /v1/chat/completions) returns `[DONE]` to the client well before the mock-side handler finishes writing chunks (the mock counter is bumped post-`wfile.flush()`). The leaked stream from case 20 finished during case 23's burst window, bumping the counter and producing a false 6-of-5 fail. Fix: in case 23, poll `/__mock__/state` for `in_flight == 0` (50 × 100ms, bounded ~5s) before issuing the reset. File: `e2e/cases/data/23_mock_memory_pressure.sh`. ## Verification ```bash e2e/tools/run-all-cases --mock-only # → 15 PASS / 0 FAIL / 7 SKIP (skipped are Tier=real, expected) ``` Tier: C (case 20 — universal bug fix on streaming SSE) + B (case 07, 23 — internal e2e infrastructure).
…ages body Input-callbacks such as Sentry's LiteLLMIntegration store a live Span at litellm_params["metadata"]["_sentry_span"] during pre-call, after the caller's metadata was validated. When that polluted metadata reaches the bare json.dumps(request_body) that builds the Anthropic /v1/messages request, the call crashes with "Object of type ... is not JSON serializable" before the request is sent; this reproduces in the >=1.0.0,<1.82.7 range Serialize the outbound body through dumps_anthropic_messages_request_body, which drops non-serializable values from metadata rather than crashing; a non-serializable value anywhere else still surfaces as a real error so genuine request bugs are not masked. This makes the failure mode impossible by construction instead of relying on the incidental serialization reordering (BerriAI#28289) that happens to mask it on current main
…ages body Input-callbacks such as Sentry's LiteLLMIntegration store a live Span at litellm_params["metadata"]["_sentry_span"] during pre-call, after the caller's metadata was validated. When that polluted metadata reaches the bare json.dumps(request_body) that builds the Anthropic /v1/messages request, the call crashes with "Object of type ... is not JSON serializable" before the request is sent; this reproduces in the >=1.0.0,<1.82.7 range Serialize the outbound body through dumps_anthropic_messages_request_body, which drops non-serializable values from metadata rather than crashing; a non-serializable value anywhere else still surfaces as a real error so genuine request bugs are not masked. This makes the failure mode impossible by construction instead of relying on the incidental serialization reordering (BerriAI#28289) that happens to mask it on current main
What this PR does
Reduces per-request and per-chunk overhead on the Anthropic
/v1/messagesstreaming path in the LiteLLM proxy. Same wire output, parity-tested.(False, {})).strip_empty_text_blocksscan when the async wrapper already sanitized.content_block_deltatext events into a single equivalent SSE event beforestream_chunk_builder, removing O(num_output_tokens)ModelResponseStreamPydantic constructions. Tool-use / thinking / citations streams fall back to the unchanged legacy path.f-stringevaluation behindisEnabledFor(DEBUG)(no more serializing full message payloads at non-debug levels), hoistcost_injection_activeout of the per-chunk loop, drop one async-generator layer per chunk inasync_sse_data_generator.scripts/benchmark_anthropic_messages_perf.pyboots a local mock Anthropic SSE provider + the proxy under test for apples-to-apples TTFT / TPM measurements across commits.Benchmark
Local
scripts/benchmark_anthropic_messages_perf.pyagainst the mock Anthropic SSE provider on the same host. 500 requests/run, concurrency 20, 80-request warmup, median of 5 runs, back-to-back on the same machine. Baseline =HEAD^(this PR reverted), Optimized =HEAD(this PR).Reproduce:
uv run python scripts/benchmark_anthropic_messages_perf.py \ --label optimized --proxy-port 4099 --provider-port 8098 \ --requests 500 --concurrency 20 --warmup 80 --repeats 5Relevant issues
Linear ticket
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
Type
🚄 Infrastructure
🧹 Refactoring
✅ Test
Changes