Add chat completions streaming benchmark and fast paths by yassin-berriai · Pull Request #27816 · BerriAI/litellm

yassin-berriai · 2026-05-13T06:15:18Z

Relevant issues

Performance investigation for /v1/chat/completions TTFT, RPS, and streaming throughput since v1.83.14-stable.

Linear ticket

N/A

Pre-Submission checklist

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Benchmark script: scripts/benchmark_chat_completions_perf.py

The script starts a local OpenAI-compatible mock provider, starts the LiteLLM proxy from the requested checkout, benchmarks direct provider non-streaming latency, proxy non-streaming latency/RPS, client-observed proxy overhead, proxy x-litellm-overhead-duration-ms, streaming TTFT, and optional full-stream completion time.

Short stream TTFT comparison vs `v1.83.14-stable`

.venv/bin/python scripts/benchmark_chat_completions_perf.py \
  --label <label> \
  --litellm-dir <checkout> \
  --proxy-command '/workspace/.venv/bin/python -c "import litellm; litellm.run_server()"' \
  --requests 1000 --concurrency 100 \
  --stream-requests 200 --stream-concurrency 20 \
  --warmup 50 --stream-warmup 10

Run	Commit	Streaming TTFT p50 ms	Streaming TTFT p95 ms	Non-stream RPS	Client overhead p50 ms	Client overhead p95 ms	`x-litellm-overhead-duration-ms` p50
stable-short-stream-final	`3d2b8fed32`	113.60	290.26	276.69	321.94	494.11	55.34
current-short-stream-final	`c49b75491f`	121.71	128.67	267.76	331.33	443.88	57.91

Streaming TTFT p95 improves by ~55.7% vs v1.83.14-stable on this workload (290.26ms -> 128.67ms). TTFT p50 is slightly worse/noisy (113.60ms -> 121.71ms), so I am not claiming a p50 TTFT win.

Sustained streaming comparison vs `v1.83.14-stable`

.venv/bin/python scripts/benchmark_chat_completions_perf.py \
  --label <label> \
  --litellm-dir <checkout> \
  --proxy-command '/workspace/.venv/bin/python -c "import litellm; litellm.run_server()"' \
  --requests 100 --concurrency 10 \
  --stream-requests 100 --stream-concurrency 10 \
  --warmup 10 --stream-warmup 10 \
  --provider-stream-content-chunks 100 \
  --measure-full-stream

Run	Commit	Streaming TTFT p50 ms	Streaming full p50 ms	Streaming full p95 ms	Streaming full RPS
stable-streaming-100-proof	`3d2b8fed32`	372.54	402.77	672.71	22.33
current-streaming-100-final-head	`c49b75491f`	398.49	391.29	538.15	24.06

Streaming full-response p95 improves by ~20.0% vs v1.83.14-stable on this workload (672.71ms -> 538.15ms), and full-stream RPS improves by ~7.7% (22.33 -> 24.06).

In-process hot-loop check

To isolate the proxy generator hot loop from HTTP/provider overhead, I also measured async_data_generator directly with no streaming callbacks configured:

chunks	old hook path	direct-stream fast path	speedup
100	4.36 ms	0.63 ms	6.96x
1000	11.96 ms	1.97 ms	6.07x
5000	55.50 ms	8.12 ms	6.83x

Important caveat: the production change materially speeds up the no-callback per-chunk generator loop and improves p95 streaming metrics in the local proxy benchmark. It does not fix x-litellm-overhead-duration-ms; that header is computed in LiteLLM core as total_response_time_ms - llm_api_duration_ms.

Regression PR finding: I did not find a reproducible single post-v1.83.14-stable PR that introduced a step-function regression. I sampled hot-path PR points including #26922, #27311, #27488, and #27812; their benchmark results varied mostly with concurrent queueing and did not isolate one regression PR.

Validation:

.venv/bin/python -m pytest \
  tests/test_litellm/proxy/test_proxy_logging_hook_detection.py \
  tests/test_litellm/proxy/test_proxy_server.py::test_async_data_generator_uses_direct_stream_fast_path_without_callbacks \
  tests/test_litellm/proxy/test_proxy_server.py::test_async_data_generator_midstream_error \
  tests/test_litellm/proxy/test_proxy_server.py::test_async_data_generator_cleanup_on_normal_completion \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_restamp_streaming_chunk_skips_matching_model \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_fast_serialize_simple_streaming_chunk_matches_model_dump_json \
  -q
# 10 passed

.venv/bin/python -m pytest \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_azure_model_router_preserves_actual_model \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_chunks_do_not_return_provider_prefixed_model \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_chunks_use_client_requested_model_before_alias_mapping \
  tests/test_litellm/proxy/test_response_model_sanitization.py::test_proxy_streaming_fastest_response_preserves_winning_model \
  -q
# 4 passed

.venv/bin/python -m ruff check litellm/proxy/proxy_server.py litellm/proxy/utils.py tests/test_litellm/proxy/test_response_model_sanitization.py
# All checks passed

.venv/bin/python -m mypy litellm/proxy/utils.py litellm/proxy/proxy_server.py --ignore-missing-imports --config-file pyproject.toml
# Success: no issues found in 2 source files

Type

🐛 Bug Fix
✅ Test

Changes

Added scripts/benchmark_chat_completions_perf.py to measure /v1/chat/completions non-streaming RPS/overhead, streaming TTFT, and optional full-stream completion time against a local OpenAI-compatible mock provider.
Added a fast early return inside ProxyLogging.post_call_response_headers_hook when no custom logger callbacks are configured.
Added a no-callback direct-stream path in async_data_generator, skipping per-chunk iterator hooks, streaming hooks, response-string accumulation, and callback scans when no streaming hooks/guardrails are configured.
Avoids restamping streaming chunk model fields when the chunk already has the requested model.
Adds a guarded fast serializer for common simple ModelResponseStream text chunks, falling back to model_dump_json for richer chunks.
Added hook-detection, direct-stream fast-path, model-restamp, and stream-serialization tests.
Updated existing streaming model-sanitization tests to decode byte SSE chunks from the optimized path.
Fixed MyPy typing for callback detection and refactored streaming helpers to satisfy Ruff.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

CLAassistant · 2026-05-13T06:15:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ cursoragent
❌ yassinkortam
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov · 2026-05-13T06:18:25Z

Codecov Report

❌ Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/proxy/auth/auth_checks.py	64.28%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Co-authored-by: Yassin Kortam <yassin@berri.ai>

yassin-berriai · 2026-05-13T06:39:35Z

@greptileai

greptile-apps · 2026-05-13T06:42:46Z

Greptile Summary

This PR adds streaming fast paths and a benchmark script for /v1/chat/completions. The core change skips per-chunk callback machinery, iterator wrapping, and full Pydantic serialization when no streaming callbacks/guardrails are configured, using a cached _CallbackCapabilities snapshot to avoid per-request callback list scanning.

Adds _callback_capabilities() cache on ProxyLogging that inspects litellm.callbacks once and short-circuits per-chunk hook paths, during_call_hook, async_pre_call_hook, and the iterator chain for no-op callback configurations.
Adds _fast_serialize_simple_model_response_stream that uses orjson for common text chunks and falls back to model_dump_json for richer chunks (tool calls, logprobs, usage, system_fingerprint, provider_specific_fields).
Adds a module-level LRU cache for _is_model_cost_zero in auth_checks.py keyed on (id(router), len(model_list), model_name) to avoid re-constructing ModelGroupInfo per auth call.

Confidence Score: 5/5

Safe to merge; the fast paths are well-guarded by the cached capability check and fall back correctly to the full callback chain whenever any streaming hook or guardrail is registered.

The central optimization is gated by a stable, invalidation-aware cache and is fully exercised by the new tests. The two noted issues (fast serializer null vs omitted model, and the cost-zero cache keyed on list length) are edge cases that do not affect correctness for typical deployments.

No files require special attention; the fast-serializer discrepancy and auth cache invalidation are minor and bounded to unusual configurations.

Important Files Changed

Filename	Overview
litellm/proxy/proxy_server.py	Adds fast serializer, direct-stream path, and early model-restamp skip; the serializer emits `"model": null` when model is None rather than omitting the key as `model_dump_json(exclude_none=True)` would.
litellm/proxy/utils.py	Introduces `_CallbackCapabilities` dataclass and `_callback_capabilities()` cache; capability detection logic correctly uses `cls.__dict__` for override checks and properly sizes/clears the cache.
litellm/proxy/auth/auth_checks.py	Adds `_MODEL_COST_ZERO_CACHE` keyed on `(id(router), len(model_list), model_name)`; invalidation by list length only means in-place cost changes won't flush the cache.
tests/test_litellm/proxy/test_proxy_logging_hook_detection.py	New test file with focused unit tests for capability detection and cache invalidation; no real network calls.
tests/test_litellm/proxy/test_proxy_server.py	Adds direct-stream fast-path test and mid-stream error/cleanup tests; existing tests updated to decode byte SSE chunks produced by the new path.
tests/test_litellm/proxy/test_response_model_sanitization.py	Existing streaming model-sanitization tests updated to force the full callback path and to decode byte chunks; coverage unchanged.
scripts/benchmark_chat_completions_perf.py	New benchmark script; starts a local mock provider and LiteLLM proxy, measures TTFT/RPS/overhead. Script-only, no production impact.

_{Reviews (2): Last reviewed commit: "perf(proxy): cache callback capabilities..." | Re-trigger Greptile}

Co-authored-by: Yassin Kortam <yassin@berri.ai>

…chain Major chat-completions hot-path wins driven by profiling against a mock provider with the standard internal proxy callbacks loaded: * Streaming iterator hook wrapped every CustomLogger callback even when the class did not override `async_post_call_streaming_iterator_hook`, adding ~11 layers of `async for chunk: yield chunk` trampolines per chunk. Only wrap callbacks that actually override the hook or have `apply_guardrail`. * New `_CallbackCapabilities` snapshot caches per-hook detection (and the resolved CustomLogger list) keyed on a signature of `litellm.callbacks`, so the per-request walk + `get_custom_logger_compatible_class` resolution collapse to a dict lookup. * `async_data_generator` now splits iterator-wrap vs per-chunk-hook gating (`needs_iterator_wrap` / `needs_per_chunk_streaming_hook`). The coalesced flag was paying `get_response_string` and `async_post_call_streaming_hook` per chunk on every deployment that shipped an iterator override (the default). * Fast-return paths in `pre_call_hook`, `during_call_hook`, and `async_post_call_streaming_hook` when nothing relevant is configured. * Cache `_is_model_cost_zero` results in the auth path — was triggering `Router.get_model_group_info` -> `ModelGroupInfo.__init__` -> `typing.get_type_hints` per request. Benchmark improvements (`scripts/benchmark_chat_completions_perf.py`): add `--repeats` with median-of-runs selection, default to a realistic 20-chunk streaming workload, default `--measure-full-stream` on, larger default warmup. Surface streaming TTFT RPS in the summary. Same-machine A/B (5 repeats, median run, 200 streams × 20 chunks): | metric | before | after | Δ | | --------------------- | ------- | ------- | -------- | | Non-stream RPS | 360 | 426 | +18.2% | | Streaming TTFT p50 | 193 ms | 183 ms | -5.5% | | Stream full p50 | 236 ms | 183 ms | -22.7% | | Stream full RPS | 54.66 | 107.75 | +97.1% | Tests: 866 auth + proxy tests pass; 257 proxy + sanitization tests pass; 4 new regression tests for the capability scanner. Lint + mypy clean on touched files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yassin-berriai · 2026-05-13T16:39:56Z

@greptileai

Add chat completions performance benchmark

bc69118

Co-authored-by: Yassin Kortam <yassin@berri.ai>

cursoragent and others added 5 commits May 13, 2026 06:21

Skip noop during-call hook on proxy requests

9c33919

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Skip noop response header hooks

ea8dfee

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Move hook detection tests under test_litellm

9ff0c33

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Cover proxy hook guard branches

3ae9155

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Fix proxy callback detector typing

7f15157

Co-authored-by: Yassin Kortam <yassin@berri.ai>

cursor Bot changed the title ~~Fix chat completions proxy performance regression~~ Reduce chat completions proxy hook overhead May 13, 2026

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Comment thread litellm/proxy/utils.py

Comment thread litellm/proxy/utils.py Outdated

cursoragent and others added 2 commits May 13, 2026 06:48

Narrow proxy hook coverage changes

1aa70c3

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Keep hook optimization within proxy logging

48a6cf2

Co-authored-by: Yassin Kortam <yassin@berri.ai>

cursor Bot changed the title ~~Reduce chat completions proxy hook overhead~~ Add chat completions benchmark and reduce response header hook overhead May 13, 2026

Add streaming fast path for no callback streams

e9f18aa

Co-authored-by: Yassin Kortam <yassin@berri.ai>

cursor Bot changed the title ~~Add chat completions benchmark and reduce response header hook overhead~~ Add chat completions streaming benchmark and fast paths May 13, 2026

cursoragent and others added 4 commits May 13, 2026 07:44

Optimize no-callback chat streaming chunks

c49b754

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Refactor streaming generator for lint

d50462c

Co-authored-by: Yassin Kortam <yassin@berri.ai>

Handle bytes SSE chunks in streaming tests

9595d6d

Co-authored-by: Yassin Kortam <yassin@berri.ai>

yassin-berriai closed this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add chat completions streaming benchmark and fast paths#27816

Add chat completions streaming benchmark and fast paths#27816
yassin-berriai wants to merge 13 commits into
litellm_internal_stagingfrom
cursor/chat-completions-perf-dfa0

yassin-berriai commented May 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

CLAassistant commented May 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 13, 2026 •

edited

Loading

Uh oh!

yassin-berriai commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

yassin-berriai commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yassin-berriai commented May 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Linear ticket

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Short stream TTFT comparison vs v1.83.14-stable

Sustained streaming comparison vs v1.83.14-stable

In-process hot-loop check

Type

Changes

Uh oh!

CLAassistant commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yassin-berriai commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

yassin-berriai commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yassin-berriai commented May 13, 2026 •

edited by cursor Bot

Loading

Short stream TTFT comparison vs `v1.83.14-stable`

Sustained streaming comparison vs `v1.83.14-stable`

CLAassistant commented May 13, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading