Skip to content

Add chat completions streaming benchmark and fast paths#27816

Closed
yassin-berriai wants to merge 13 commits into
litellm_internal_stagingfrom
cursor/chat-completions-perf-dfa0
Closed

Add chat completions streaming benchmark and fast paths#27816
yassin-berriai wants to merge 13 commits into
litellm_internal_stagingfrom
cursor/chat-completions-perf-dfa0

Conversation

@yassin-berriai

@yassin-berriai yassin-berriai commented May 13, 2026

Copy link
Copy Markdown
Contributor

Relevant issues

Performance investigation for /v1/chat/completions TTFT, RPS, and streaming throughput since v1.83.14-stable.

Linear ticket

N/A

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

  • Branch creation CI run
    Link:
  • CI run for the last commit
    Link:
  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Benchmark script: scripts/benchmark_chat_completions_perf.py

The script starts a local OpenAI-compatible mock provider, starts the LiteLLM proxy from the requested checkout, benchmarks direct provider non-streaming latency, proxy non-streaming latency/RPS, client-observed proxy overhead, proxy x-litellm-overhead-duration-ms, streaming TTFT, and optional full-stream completion time.

Short stream TTFT comparison vs v1.83.14-stable

.venv/bin/python scripts/benchmark_chat_completions_perf.py \
  --label <label> \
  --litellm-dir <checkout> \
  --proxy-command '/workspace/.venv/bin/python -c "import litellm; litellm.run_server()"' \
  --requests 1000 --concurrency 100 \
  --stream-requests 200 --stream-concurrency 20 \
  --warmup 50 --stream-warmup 10
Run Commit Streaming TTFT p50 ms Streaming TTFT p95 ms Non-stream RPS Client overhead p50 ms Client overhead p95 ms x-litellm-overhead-duration-ms p50
stable-short-stream-final 3d2b8fed32 113.60 290.26 276.69 321.94 494.11 55.34
current-short-stream-final c49b75491f 121.71 128.67 267.76 331.33 443.88 57.91

Streaming TTFT p95 improves by ~55.7% vs v1.83.14-stable on this workload (290.26ms -> 128.67ms). TTFT p50 is slightly worse/noisy (113.60ms -> 121.71ms), so I am not claiming a p50 TTFT win.

Sustained streaming comparison vs v1.83.14-stable

.venv/bin/python scripts/benchmark_chat_completions_perf.py \
  --label <label> \
  --litellm-dir <checkout> \
  --proxy-command '/workspace/.venv/bin/python -c "import litellm; litellm.run_server()"' \
  --requests 100 --concurrency 10 \
  --stream-requests 100 --stream-concurrency 10 \
  --warmup 10 --stream-warmup 10 \
  --provider-stream-content-chunks 100 \
  --measure-full-stream
Run Commit Streaming TTFT p50 ms Streaming full p50 ms Streaming full p95 ms Streaming full RPS
stable-streaming-100-proof 3d2b8fed32 372.54 402.77 672.71 22.33
current-streaming-100-final-head c49b75491f 398.49 391.29 538.15 24.06

Streaming full-response p95 improves by ~20.0% vs v1.83.14-stable on this workload (672.71ms -> 538.15ms), and full-stream RPS improves by ~7.7% (22.33 -> 24.06).

In-process hot-loop check

To isolate the proxy generator hot loop from HTTP/provider overhead, I also measured async_data_generator directly with no streaming callbacks configured:

chunks old hook path direct-stream fast path speedup
100 4.36 ms 0.63 ms 6.96x
1000 11.96 ms 1.97 ms 6.07x
5000 55.50 ms 8.12 ms 6.83x

Important caveat: the production change materially speeds up the no-callback per-chunk generator loop and improves p95 streaming metrics in the local proxy benchmark. It does not fix x-litellm-overhead-duration-ms; that header is computed in LiteLLM core as total_response_time_ms - llm_api_duration_ms.

Regression PR finding: I did not find a reproducible single post-v1.83.14-stable PR that introduced a step-function regression. I sampled hot-path PR points including #26922, #27311, #27488, and #27812; their benchmark results varied mostly with concurrent queueing and did not isolate one regression PR.

Validation:

.venv/bin/python -m pytest \
  tests/test_litellm/proxy/test_proxy_logging_hook_detection.py \
  tests/test_litellm/proxy/test_proxy_server.py::test_async_data_generator_uses_direct_stream_fast_path_without_callbacks \
  tests/test_litellm/proxy/test_proxy_server.py::test_async_data_generator_midstream_error \
  tests/test_litellm/proxy/test_proxy_server.py::test_async_data_generator_cleanup_on_normal_completion \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_restamp_streaming_chunk_skips_matching_model \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_fast_serialize_simple_streaming_chunk_matches_model_dump_json \
  -q
# 10 passed

.venv/bin/python -m pytest \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_azure_model_router_preserves_actual_model \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_chunks_do_not_return_provider_prefixed_model \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_chunks_use_client_requested_model_before_alias_mapping \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_fastest_response_preserves_winning_model \
  -q
# 4 passed

.venv/bin/python -m ruff check litellm/proxy/proxy_server.py litellm/proxy/utils.py tests/test_litellm/proxy/test_response_model_sanitization.py
# All checks passed

.venv/bin/python -m mypy litellm/proxy/utils.py litellm/proxy/proxy_server.py --ignore-missing-imports --config-file pyproject.toml
# Success: no issues found in 2 source files

Type

🐛 Bug Fix
✅ Test

Changes

  • Added scripts/benchmark_chat_completions_perf.py to measure /v1/chat/completions non-streaming RPS/overhead, streaming TTFT, and optional full-stream completion time against a local OpenAI-compatible mock provider.
  • Added a fast early return inside ProxyLogging.post_call_response_headers_hook when no custom logger callbacks are configured.
  • Added a no-callback direct-stream path in async_data_generator, skipping per-chunk iterator hooks, streaming hooks, response-string accumulation, and callback scans when no streaming hooks/guardrails are configured.
  • Avoids restamping streaming chunk model fields when the chunk already has the requested model.
  • Adds a guarded fast serializer for common simple ModelResponseStream text chunks, falling back to model_dump_json for richer chunks.
  • Added hook-detection, direct-stream fast-path, model-restamp, and stream-serialization tests.
  • Updated existing streaming model-sanitization tests to decode byte SSE chunks from the optimized path.
  • Fixed MyPy typing for callback detection and refactored streaming helpers to satisfy Ruff.
Open in Web Open in Cursor 

Co-authored-by: Yassin Kortam <yassin@berri.ai>
@CLAassistant

CLAassistant commented May 13, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ cursoragent
❌ yassinkortam
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov

codecov Bot commented May 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/proxy/auth/auth_checks.py 64.28% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

cursoragent and others added 5 commits May 13, 2026 06:21
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
@cursor cursor Bot changed the title Fix chat completions proxy performance regression Reduce chat completions proxy hook overhead May 13, 2026
@yassin-berriai

Copy link
Copy Markdown
Contributor Author

@greptileai

@greptile-apps

greptile-apps Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds streaming fast paths and a benchmark script for /v1/chat/completions. The core change skips per-chunk callback machinery, iterator wrapping, and full Pydantic serialization when no streaming callbacks/guardrails are configured, using a cached _CallbackCapabilities snapshot to avoid per-request callback list scanning.

  • Adds _callback_capabilities() cache on ProxyLogging that inspects litellm.callbacks once and short-circuits per-chunk hook paths, during_call_hook, async_pre_call_hook, and the iterator chain for no-op callback configurations.
  • Adds _fast_serialize_simple_model_response_stream that uses orjson for common text chunks and falls back to model_dump_json for richer chunks (tool calls, logprobs, usage, system_fingerprint, provider_specific_fields).
  • Adds a module-level LRU cache for _is_model_cost_zero in auth_checks.py keyed on (id(router), len(model_list), model_name) to avoid re-constructing ModelGroupInfo per auth call.

Confidence Score: 5/5

Safe to merge; the fast paths are well-guarded by the cached capability check and fall back correctly to the full callback chain whenever any streaming hook or guardrail is registered.

The central optimization is gated by a stable, invalidation-aware cache and is fully exercised by the new tests. The two noted issues (fast serializer null vs omitted model, and the cost-zero cache keyed on list length) are edge cases that do not affect correctness for typical deployments.

No files require special attention; the fast-serializer discrepancy and auth cache invalidation are minor and bounded to unusual configurations.

Important Files Changed

Filename Overview
litellm/proxy/proxy_server.py Adds fast serializer, direct-stream path, and early model-restamp skip; the serializer emits "model": null when model is None rather than omitting the key as model_dump_json(exclude_none=True) would.
litellm/proxy/utils.py Introduces _CallbackCapabilities dataclass and _callback_capabilities() cache; capability detection logic correctly uses cls.__dict__ for override checks and properly sizes/clears the cache.
litellm/proxy/auth/auth_checks.py Adds _MODEL_COST_ZERO_CACHE keyed on (id(router), len(model_list), model_name); invalidation by list length only means in-place cost changes won't flush the cache.
tests/test_litellm/proxy/test_proxy_logging_hook_detection.py New test file with focused unit tests for capability detection and cache invalidation; no real network calls.
tests/test_litellm/proxy/test_proxy_server.py Adds direct-stream fast-path test and mid-stream error/cleanup tests; existing tests updated to decode byte SSE chunks produced by the new path.
tests/test_litellm/proxy/test_response_model_sanitization.py Existing streaming model-sanitization tests updated to force the full callback path and to decode byte chunks; coverage unchanged.
scripts/benchmark_chat_completions_perf.py New benchmark script; starts a local mock provider and LiteLLM proxy, measures TTFT/RPS/overhead. Script-only, no production impact.

Reviews (2): Last reviewed commit: "perf(proxy): cache callback capabilities..." | Re-trigger Greptile

Comment thread litellm/proxy/utils.py
Comment thread litellm/proxy/utils.py Outdated
cursoragent and others added 2 commits May 13, 2026 06:48
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
@cursor cursor Bot changed the title Reduce chat completions proxy hook overhead Add chat completions benchmark and reduce response header hook overhead May 13, 2026
Co-authored-by: Yassin Kortam <yassin@berri.ai>
@cursor cursor Bot changed the title Add chat completions benchmark and reduce response header hook overhead Add chat completions streaming benchmark and fast paths May 13, 2026
cursoragent and others added 4 commits May 13, 2026 07:44
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
…chain

Major chat-completions hot-path wins driven by profiling against a mock
provider with the standard internal proxy callbacks loaded:

* Streaming iterator hook wrapped every CustomLogger callback even when the
  class did not override `async_post_call_streaming_iterator_hook`, adding
  ~11 layers of `async for chunk: yield chunk` trampolines per chunk. Only
  wrap callbacks that actually override the hook or have `apply_guardrail`.
* New `_CallbackCapabilities` snapshot caches per-hook detection (and the
  resolved CustomLogger list) keyed on a signature of `litellm.callbacks`,
  so the per-request walk + `get_custom_logger_compatible_class` resolution
  collapse to a dict lookup.
* `async_data_generator` now splits iterator-wrap vs per-chunk-hook gating
  (`needs_iterator_wrap` / `needs_per_chunk_streaming_hook`). The coalesced
  flag was paying `get_response_string` and `async_post_call_streaming_hook`
  per chunk on every deployment that shipped an iterator override (the
  default).
* Fast-return paths in `pre_call_hook`, `during_call_hook`, and
  `async_post_call_streaming_hook` when nothing relevant is configured.
* Cache `_is_model_cost_zero` results in the auth path — was triggering
  `Router.get_model_group_info` -> `ModelGroupInfo.__init__` ->
  `typing.get_type_hints` per request.

Benchmark improvements (`scripts/benchmark_chat_completions_perf.py`):
add `--repeats` with median-of-runs selection, default to a realistic
20-chunk streaming workload, default `--measure-full-stream` on, larger
default warmup. Surface streaming TTFT RPS in the summary.

Same-machine A/B (5 repeats, median run, 200 streams × 20 chunks):

| metric                | before  | after   | Δ        |
| --------------------- | ------- | ------- | -------- |
| Non-stream RPS        |   360   |   426   | +18.2%   |
| Streaming TTFT p50    | 193 ms  | 183 ms  |  -5.5%   |
| Stream full p50       | 236 ms  | 183 ms  | -22.7%   |
| Stream full RPS       |  54.66  | 107.75  | +97.1%   |

Tests: 866 auth + proxy tests pass; 257 proxy + sanitization tests pass;
4 new regression tests for the capability scanner. Lint + mypy clean on
touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yassin-berriai

Copy link
Copy Markdown
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants