Add chat completions streaming benchmark and fast paths#27816
Add chat completions streaming benchmark and fast paths#27816yassin-berriai wants to merge 13 commits into
Conversation
Co-authored-by: Yassin Kortam <yassin@berri.ai>
|
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Greptile SummaryThis PR adds streaming fast paths and a benchmark script for
Confidence Score: 5/5Safe to merge; the fast paths are well-guarded by the cached capability check and fall back correctly to the full callback chain whenever any streaming hook or guardrail is registered. The central optimization is gated by a stable, invalidation-aware cache and is fully exercised by the new tests. The two noted issues (fast serializer null vs omitted model, and the cost-zero cache keyed on list length) are edge cases that do not affect correctness for typical deployments. No files require special attention; the fast-serializer discrepancy and auth cache invalidation are minor and bounded to unusual configurations.
|
| Filename | Overview |
|---|---|
| litellm/proxy/proxy_server.py | Adds fast serializer, direct-stream path, and early model-restamp skip; the serializer emits "model": null when model is None rather than omitting the key as model_dump_json(exclude_none=True) would. |
| litellm/proxy/utils.py | Introduces _CallbackCapabilities dataclass and _callback_capabilities() cache; capability detection logic correctly uses cls.__dict__ for override checks and properly sizes/clears the cache. |
| litellm/proxy/auth/auth_checks.py | Adds _MODEL_COST_ZERO_CACHE keyed on (id(router), len(model_list), model_name); invalidation by list length only means in-place cost changes won't flush the cache. |
| tests/test_litellm/proxy/test_proxy_logging_hook_detection.py | New test file with focused unit tests for capability detection and cache invalidation; no real network calls. |
| tests/test_litellm/proxy/test_proxy_server.py | Adds direct-stream fast-path test and mid-stream error/cleanup tests; existing tests updated to decode byte SSE chunks produced by the new path. |
| tests/test_litellm/proxy/test_response_model_sanitization.py | Existing streaming model-sanitization tests updated to force the full callback path and to decode byte chunks; coverage unchanged. |
| scripts/benchmark_chat_completions_perf.py | New benchmark script; starts a local mock provider and LiteLLM proxy, measures TTFT/RPS/overhead. Script-only, no production impact. |
Reviews (2): Last reviewed commit: "perf(proxy): cache callback capabilities..." | Re-trigger Greptile
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
…chain Major chat-completions hot-path wins driven by profiling against a mock provider with the standard internal proxy callbacks loaded: * Streaming iterator hook wrapped every CustomLogger callback even when the class did not override `async_post_call_streaming_iterator_hook`, adding ~11 layers of `async for chunk: yield chunk` trampolines per chunk. Only wrap callbacks that actually override the hook or have `apply_guardrail`. * New `_CallbackCapabilities` snapshot caches per-hook detection (and the resolved CustomLogger list) keyed on a signature of `litellm.callbacks`, so the per-request walk + `get_custom_logger_compatible_class` resolution collapse to a dict lookup. * `async_data_generator` now splits iterator-wrap vs per-chunk-hook gating (`needs_iterator_wrap` / `needs_per_chunk_streaming_hook`). The coalesced flag was paying `get_response_string` and `async_post_call_streaming_hook` per chunk on every deployment that shipped an iterator override (the default). * Fast-return paths in `pre_call_hook`, `during_call_hook`, and `async_post_call_streaming_hook` when nothing relevant is configured. * Cache `_is_model_cost_zero` results in the auth path — was triggering `Router.get_model_group_info` -> `ModelGroupInfo.__init__` -> `typing.get_type_hints` per request. Benchmark improvements (`scripts/benchmark_chat_completions_perf.py`): add `--repeats` with median-of-runs selection, default to a realistic 20-chunk streaming workload, default `--measure-full-stream` on, larger default warmup. Surface streaming TTFT RPS in the summary. Same-machine A/B (5 repeats, median run, 200 streams × 20 chunks): | metric | before | after | Δ | | --------------------- | ------- | ------- | -------- | | Non-stream RPS | 360 | 426 | +18.2% | | Streaming TTFT p50 | 193 ms | 183 ms | -5.5% | | Stream full p50 | 236 ms | 183 ms | -22.7% | | Stream full RPS | 54.66 | 107.75 | +97.1% | Tests: 866 auth + proxy tests pass; 257 proxy + sanitization tests pass; 4 new regression tests for the capability scanner. Lint + mypy clean on touched files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relevant issues
Performance investigation for
/v1/chat/completionsTTFT, RPS, and streaming throughput sincev1.83.14-stable.Linear ticket
N/A
Pre-Submission checklist
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).
CI (LiteLLM team)
Link:
Link:
Links:
Screenshots / Proof of Fix
Benchmark script:
scripts/benchmark_chat_completions_perf.pyThe script starts a local OpenAI-compatible mock provider, starts the LiteLLM proxy from the requested checkout, benchmarks direct provider non-streaming latency, proxy non-streaming latency/RPS, client-observed proxy overhead, proxy
x-litellm-overhead-duration-ms, streaming TTFT, and optional full-stream completion time.Short stream TTFT comparison vs
v1.83.14-stablex-litellm-overhead-duration-msp503d2b8fed32c49b75491fStreaming TTFT p95 improves by ~55.7% vs
v1.83.14-stableon this workload (290.26ms -> 128.67ms). TTFT p50 is slightly worse/noisy (113.60ms -> 121.71ms), so I am not claiming a p50 TTFT win.Sustained streaming comparison vs
v1.83.14-stable3d2b8fed32c49b75491fStreaming full-response p95 improves by ~20.0% vs
v1.83.14-stableon this workload (672.71ms -> 538.15ms), and full-stream RPS improves by ~7.7% (22.33 -> 24.06).In-process hot-loop check
To isolate the proxy generator hot loop from HTTP/provider overhead, I also measured
async_data_generatordirectly with no streaming callbacks configured:Important caveat: the production change materially speeds up the no-callback per-chunk generator loop and improves p95 streaming metrics in the local proxy benchmark. It does not fix
x-litellm-overhead-duration-ms; that header is computed in LiteLLM core astotal_response_time_ms - llm_api_duration_ms.Regression PR finding: I did not find a reproducible single post-
v1.83.14-stablePR that introduced a step-function regression. I sampled hot-path PR points including #26922, #27311, #27488, and #27812; their benchmark results varied mostly with concurrent queueing and did not isolate one regression PR.Validation:
Type
🐛 Bug Fix
✅ Test
Changes
scripts/benchmark_chat_completions_perf.pyto measure/v1/chat/completionsnon-streaming RPS/overhead, streaming TTFT, and optional full-stream completion time against a local OpenAI-compatible mock provider.ProxyLogging.post_call_response_headers_hookwhen no custom logger callbacks are configured.async_data_generator, skipping per-chunk iterator hooks, streaming hooks, response-string accumulation, and callback scans when no streaming hooks/guardrails are configured.ModelResponseStreamtext chunks, falling back tomodel_dump_jsonfor richer chunks.