Skip to content

[Infra] Promote internal staging to main#28100

Merged
yuneng-berri merged 82 commits into
mainfrom
litellm_internal_staging
May 17, 2026
Merged

[Infra] Promote internal staging to main#28100
yuneng-berri merged 82 commits into
mainfrom
litellm_internal_staging

Conversation

@yuneng-berri

Copy link
Copy Markdown
Collaborator

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

cursoragent and others added 30 commits May 13, 2026 00:31
…eaks

Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
These seven test files were on _RESPX_CONFLICTING_FILES, which made the
auto-marker skip them entirely. Inspecting the source shows the only
respx artifact is a top-level 'from respx import MockRouter' that no
test ever uses - no @pytest.mark.respx, no respx_mock fixture, no
respx.mock context manager. The import is dead code left over from a
previous mocking pattern.

Now that apply_vcr_auto_marker_to_items detects respx per-item via the
marker / fixture chain (b637d9f), the file-level skip is no longer
needed for these files - they were the reason the OpenAI tests
(test_o3_reasoning_effort, test_streaming_response[o1/o3-mini],
TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search,
TestOpenAIO3::test_web_search, etc.) ran live every CI build despite
the cassette cache being healthy.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…en module-level file handles

Module-level

    TEST_IMAGES = [
        open(os.path.join(pwd, 'ishaan_github.png'), 'rb'),
        open(os.path.join(pwd, 'litellm_site.png'), 'rb'),
    ]
    SINGLE_TEST_IMAGE = open(...)

opens the file once at import. After the first multipart upload, the
file pointer is at EOF, so every subsequent test in the same xdist
worker sends an empty multipart body. That non-determinism (a) blows
the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so
_RedisPersister.save_cassette refuses to save it, and (b) re-bills the
live image edit endpoint on every CI run.

Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py
shows six tests parking at 51-52 cassette entries
(TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False],
TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio,
test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False],
test_multiple_image_edit_with_different_formats).

Replace the module-level file handles with _make_test_images() /
_make_single_test_image() factories that return fresh _RewindableImage
(BytesIO subclass) objects whose pointer always starts at 0. The image
bytes are read once at import into module-level constants
(_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is
unchanged.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock
endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com'
(region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit
host check for that pattern so Bedrock live calls are visible to the
probe, and update the unit test accordingly. Also drop the unused
'_LIVE_CALL_PROBE_INSTALLED' module variable.
… upload

The _RewindableImage(BytesIO) wrapper auto-rewound on every read after
EOF, which made the OpenAI SDK's multipart upload writer read the same
bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd:

    [gw0] node down: Not properly terminated
    replacing crashed worker gw0
    ...
    worker 'gw1' crashed while running
        'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]'

The auto-rewind was added defensively for parametrized + flaky-retried
tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk
already calls get_base_image_edit_call_args() once per invocation and
that helper now constructs fresh streams via _make_test_images(), so
rewinding inside the stream is unnecessary. Replace with plain BytesIO
seeded with the cached image bytes.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
The pass_through prompt-caching tests
(test_prompt_caching_returns_cache_read_tokens_on_second_call,
test_prompt_caching_streaming_second_call_returns_cache_read) make a
warm-up call and then assert the *second* call sees a non-zero
cache_read_input_tokens count from the upstream's prompt-cache. VCR
replay can't model cross-call provider state — both calls match the
same cassette episode, so the second call returns the first call's
pre-warmup response and the assertion fails:

    AssertionError: Expected cache_read_input_tokens > 0 on second call,
    but got 0. Full usage: {'input_tokens': 4986,
    'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0}

This started biting after the AWS SigV4 fingerprint stabilization
(b637d9f): Bedrock requests now produce a stable per-access-key
fingerprint instead of a per-request signature, so cassettes
successfully replay where they previously always missed and re-recorded
live. Opt these tests out via skip_nodeid_suffixes so they run live and
match the existing pattern in tests/llm_translation/conftest.py
(::test_prompt_caching).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
… to AST

Address two greptile P2 review concerns on PR #27795:

1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE
   regardless of cassette state. A cassette that grew past the cap
   historically but this run only *replayed* (dirty=False) is
   healthy — the persister never tries to save, so the cache state is
   stable and the next run will replay too. Only flag OVERFLOW when
   dirty=True (new episodes were recorded that the persister would
   refuse to save). Add a regression test covering the
   dirty=False + large-total case.

2. _module_uses_respx did substring matching on the module source,
   which false-positives on comments / docstrings / string literals.
   A comment like # Previously tried respx.mock but switched to
   vcrpy would keep a file pinned on the opt-out list, defeating the
   dead-import pruning goal of this PR. Replace the substring scan
   with an ast.NodeVisitor (_RespxUsageVisitor) that only
   counts:

     - @pytest.mark.respx / @respx.mock decorators
     - with respx.mock(): ... (sync + async) context managers
     - respx.mock(...) calls outside a with/decorator
     - function parameters / fixture names equal to respx_mock

   Add tests for the comment / docstring / string-literal cases plus
   each real-usage pattern.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…mary actually renders under xdist

`_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate`
— which runs in each xdist worker process. The controller's
`pytest_terminal_summary` then reads its own empty `_session_stats` and
bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections
the rest of this PR adds never make it into CI logs in the dist mode CI
actually uses.

Ship a structured `vcr_outcome` payload via `user_properties` (which
xdist round-trips) and add `aggregate_report_outcome` on the controller
to fold worker outcomes into `_session_stats`. The recording process
tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can
tell "single-process — already counted locally" apart from "produced by
a worker — needs aggregation here", and not double-count when there's
no xdist.

Covered by 9 new unit tests in test_vcr_classification.py including the
end-to-end summary render path.
…itellm_vcr-cache-observability-and-fixes-c5bc
* fix: block NaN/Inf budget bypass and add missing non-admin guards

Addresses three security issues:

GHSA-wvg4-6222-3q4r: /user/update exposes max_budget, soft_budget, spend
to self-editing non-admin users with no server-side guard. Non-admin callers
now receive HTTP 403 if any of those fields appear in the update payload.

GHSA-q775-qw9r-2r4g: _enforce_upperbound_key_params returned early (no-op)
when upperbound_key_generate_params was absent from config, letting any
authenticated user generate a key with unlimited max_budget. Fix adds a
delegated-authority ceiling in _common_key_generation_helper: non-admins
cannot grant a key more budget than their own key carries.

GHSA-2rv4-xv66-fpjg: float('nan') passes every `value < 0` guard because
nan < 0 is False in Python, and spend >= nan is always False, permanently
disabling budget enforcement for any entity carrying a NaN max_budget.
All write-time budget guards now use `not math.isfinite(v) or v < 0`.
_enforce_upperbound_key_params validates finiteness unconditionally (before
the early-return). All spend-enforcement comparisons in auth_checks.py are
now guarded with math.isfinite(max_budget) as defense-in-depth.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: close budget ceiling bypass for callers with no max_budget (GHSA-q775)

Non-admin callers whose API key has no explicit max_budget (None) could
bypass the delegated-authority ceiling and create keys with arbitrary
budgets. Now blocks budget assignment when caller has no budget configured.
Also removes redundant inline import of LitellmUserRoles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: only apply budget ceiling to explicitly requested max_budget

Capture the caller-supplied max_budget before _enforce_upperbound_key_params
can fill it with a default, so auto-filled defaults don't trigger the
ceiling guard for non-admin users with no budget on their own key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: capture requested max_budget before any defaults are applied

Move _requested_max_budget capture before both default_key_generate_params
and upperbound_key_generate_params mutations, so auto-filled values don't
trigger the ceiling check for non-admin users.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: allow unlimited-budget callers to delegate any budget

Callers with max_budget=None (unlimited) can legitimately create
budget-capped keys. Only block when caller has an explicit budget
and the requested budget exceeds it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* feat(lasso): extend LassoGuardrail to support tool calling (RND-5748)

* fix(lasso): PR review followups for tool-calling guardrail (RND-5748)

* fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748)

* fix(lasso): use model role for tool_use blocks (RND-5748)

* test(lasso): add round-trip tests for message transformation (RND-5748)

* fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748)

* fix(lasso): inspect Responses-API input field (RND-5748)

* fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748)

* fix(lasso): flatten list content in tool_result.content (RND-5748)

* fix(lasso): remap multimodal list content during masking (RND-5748)

Bug: _map_masked_messages_back counted list-content messages in
original_text_count but the remap loop only handled isinstance(str).
The positional text_cursor never advanced for list messages, causing
all subsequent masked texts to be written onto the wrong messages.

Fix: added elif isinstance(content, list) branch that replaces the
list with the masked text string and advances the cursor — mirrors
the existing string-content branch. Also handles the assistant +
tool_calls combo for list-content messages.

Test: test_map_masked_messages_back_list_content verifies a user
message with [text + image_url] followed by an assistant message
gets correct masked content on both (cursor stays aligned).

* refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748)

The dict-vs-object access pattern (x.get('y') if isinstance(x, dict)
else getattr(x, 'y', None)) was duplicated 14 times across 5 methods.

_get_field(obj, field) — single-point dict/Pydantic field access.
_extract_tool_call_fields(call) — returns (call_id, name, parsed_input)
with JSON argument parsing, replacing ~30 duplicate lines in both
async_post_call_success_hook and _expand_messages_for_classification.

Also simplified _update_tool_calls_from_masked, _prepare_payload tool
mapping, and _apply_masking_to_model_response call_id extraction.

Net ~60 lines removed. No behavior change — all 32 tests pass.

* fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748)

_apply_masking_to_model_response used a bare text_cursor without
verifying 1:1 correspondence between text-bearing choices and masked
text entries. If Lasso returned a different number of text messages
than choices with content, masked text would be applied to the wrong
choice or silently skip choices.

Added the same count-mismatch guard pattern already used in
_map_masked_messages_back: count original text-bearing choices,
compare to masked_text length, skip text remap on mismatch with a
warning log. Tool_call masking via id-based lookup is unaffected.

Tests:
- test_apply_masking_to_model_response_multiple_choices: verifies
  correct per-choice masked text with 2 choices
- test_apply_masking_to_model_response_count_mismatch: verifies
  content is left unchanged when counts disagree

* fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748)

* tool-call args: when function.arguments is malformed JSON or parses
  to a non-object, preserve the raw string as {"arguments": <raw>} so
  Lasso still inspects it instead of receiving input=None. Covers both
  pre-call and post-call extraction (shared helper). Also resolves the
  CodeQL empty-except warning since the except body now assigns parsed=None.
* Responses-API input: when a request carries both "messages" and
  "input", inspect both. Previously a benign messages array let the
  guardrail skip data["input"] entirely. The masking write-back is
  split via a count boundary so masked messages flow back to
  data["messages"] and masked input flows back to data["input"]
  without cross-contamination.

Tests: malformed/non-object args round-trip, dual-field classification,
dual-field masking write-back split.

* chore(lasso): black formatting + comment on expand skip branch (RND-5748)

* black: wrap two long expressions in lasso.py and reformat dict
  literals in test_lasso.py to satisfy CI lint.
* add a short comment in _expand_messages_for_classification
  explaining why empty string and None content are intentionally
  skipped (None is the OpenAI shape for a pure tool-call turn).

* fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748)

* Narrow `response.get("messages")` into a local before slicing so
  mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable.
* Rename the two write-side `func` bindings in
  `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so
  mypy doesn't unify the dict and Any|None branches.
* Rename the inner loop variable in `_apply_masking_to_model_response`
  from `msg` to `masked_msg` to avoid clashing with the
  `msg = choice.message` rebinding below.

No behavior change; resolves the 7 mypy errors from the CI lint job.
- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead
- Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered
- Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active
- Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields
- Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk
- Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement
- Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support
- Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
)

The mutation-test workflow timed out at the 350-minute job cap when
running whole-folder mutation against litellm/proxy/management_endpoints/
(~30 files, ~1.5 MB of source). Every mutant was running the full
test suite, and mutants were generated for lines no test covers — which
would survive regardless, just wasting compute.

mutmut 3.x's mutate_only_covered_lines setting runs the suite once up
front to compute coverage, then skips mutating uncovered lines. This
cuts the mutant count dramatically and is the right semantic for the
score (no test → no kill possible → uncountable). Per-mutant test
filtering by function name is already automatic in mutmut 3.x; no
external coverage step is needed.
…der body (#27913)

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body

PR #27001 (atomic TPM rate limit) introduced a reservation flow that
writes four LiteLLM-internal keys onto the request data dict:

  _litellm_rate_limit_descriptors
  _litellm_tpm_reserved_tokens
  _litellm_tpm_reserved_model
  _litellm_tpm_reserved_scopes
  _litellm_tpm_reservation_released

These keys are forwarded as request body params to the upstream provider,
which rejects them as unknown fields:

  OpenAI    -> 400 'Unknown parameter: _litellm_rate_limit_descriptors'
              (mapped by litellm to RateLimitError / 429, hiding the bug
               behind a misleading 'throttling_error' code)
  Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are
               not permitted'

Net effect: every chat completion against any real provider fails the
moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced
key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check
itself still runs (raises 429 on over-limit), but the success path
poisons the upstream body.

Reproduced on litellm_internal_staging HEAD (410ce76) against
gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request
fails with the provider's unknown-field error.

Fix: the stash is metadata only.

  - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS
    registry so we have a single source of truth for stash keys.
  - New helper _stash_value_in_metadata_channels writes to
    data['metadata'] / data['litellm_metadata'] without touching the
    top level.
  - _stash_reservation_in_data and the descriptor stash now route
    through that helper. _mark_reservation_released stops writing
    top-level.
  - _lookup_stashed_value also checks kwargs['metadata'] /
    kwargs['litellm_metadata'] (raw request_data shape) in addition to
    kwargs['litellm_params']['metadata'] (completion kwargs shape).
  - async_post_call_failure_hook now reads descriptors via the unified
    metadata lookup instead of request_data.get(top-level).
  - Defense in depth: async_pre_call_hook strips any stash key that
    somehow surfaced at the top level (stale cache, future refactor,
    test fixture) before returning.

Tests:
  - New regression test asserts no _litellm_* stash key is present at
    the top level of data after async_pre_call_hook, and that the
    metadata channel still carries the reservation + descriptors so
    success / failure reconciliation works.
  - Existing test_tpm_concurrent.py tests that asserted top-level
    presence are updated to read from data['metadata'] — the location
    is an implementation detail; the spec is that post-call callbacks
    can resolve the stash.

Verified end-to-end against OpenAI gpt-4o-mini and Anthropic
claude-haiku-4-5 via /v1/chat/completions on a low-rpm key:

  - With limits not exceeded: HTTP 200, valid completion response,
    no leaked fields in body.
  - With RPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: requests').
  - With TPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: tokens').

Full v3 hook test suite passes (171 tests).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments

Address greptile P2: test fixture now uses the imported constant.
Drop comments that re-explain what well-named identifiers already convey.

* fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse

Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at
the start of async_pre_call_hook. Without this, an authenticated caller can
inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in
body metadata, trigger a proxy-side rejection, and cause
async_post_call_failure_hook to refund TPM counters against attacker-named
scopes (e.g. another tenant's api_key).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
* fix: allow for allowlisted redirect URIs

* github comment addressing

* Update litellm/proxy/_experimental/mcp_server/oauth_utils.py

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* harden oauth wildcard further

* test: cover wildcard entry with dot-leading suffix rejection

---------

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>
…de Desktop / Cowork citations) (#27886)

* feat(custom_logger): add async_post_agentic_loop_response_hook

Lets a CustomLogger shape the response returned by the agentic-loop
follow-up call without bypassing the loop's safety / observability
machinery (depth tracking, fingerprinting, etc.). Default returns the
response unchanged.

Used by websearch_interception to inject Anthropic-native
web_search_tool_result blocks when the originating client requested a
native web_search_* tool.

* feat(llm_http_handler): call post-agentic-loop hook on the originating callback

In _execute_anthropic_agentic_plan, after anthropic_messages.acreate
returns, call the originating callback's
async_post_agentic_loop_response_hook so it can mutate the final
response (e.g. inject native tool_result blocks). Pass the callback
through from _call_agentic_completion_hooks.

Exceptions in the post-hook are caught and logged so a buggy callback
can't kill the request.

* feat(websearch_interception): add is_anthropic_native_web_search_tool

Identifies tools the Anthropic-native clients (Claude Desktop, the
Anthropic SDK, the Anthropic Console) use to request native search:
type starts with "web_search_" (e.g. web_search_20250305). Rejects the
LiteLLM standard tool, the OpenAI-function variant, the bare
"WebSearch" legacy name, and the bare "web_search" Claude Code shape.

This lets us decide per-request whether the client expects
web_search_tool_result content blocks in the response, without
renaming any existing constants or touching native-provider skip
logic.

* feat(websearch_interception): add build_web_search_tool_result_block

Produces the Anthropic-native web_search_tool_result content block
from a structured SearchResponse. Anthropic-native clients use this
block to populate citations / source links — the existing text-blob
flatten path only feeds readable evidence to the model and discards
the structure, so this builder gives us the missing piece.

Shape matches https://docs.anthropic.com/en/api/web-search-tool —
web_search_result items carry url, title, page_age, encrypted_content
(empty string when the search provider doesn't supply one).

* feat(websearch_interception): emit native web_search_tool_result blocks

When the originating client request carried a native Anthropic
web_search_* tool, the final response now also carries
web_search_tool_result content blocks alongside the model's text
answer — so Claude Desktop / Anthropic SDK clients can populate the
citations panel and replay conversation history with structured search
evidence.

Wiring:
- Pre-request hooks (both deployment + Anthropic path) set a flag on
  kwargs when they see a native web_search_* tool, so the signal
  survives the conversion-to-litellm_web_search step regardless of
  which hook fires first.
- _execute_search now returns (text, SearchResponse) so the structured
  results aren't lost when the text is flattened for the follow-up
  model call.
- _build_anthropic_request_patch returns the parallel list of
  SearchResponse objects.
- async_build_agentic_loop_plan pre-builds the web_search_tool_result
  blocks (one per tool_use_id) and stashes them on plan.metadata when
  the flag is set.
- async_post_agentic_loop_response_hook reads the metadata and
  prepends the blocks to response.content.
- _execute_agentic_loop mirrors the injection for the legacy path so
  both paths behave identically.

Clients that send the LiteLLM standard tool keep the existing
text-only behavior — no regression.

* test(websearch_interception): cover native web_search_tool_result emission

18 tests across:
- detector branches (native vs litellm-standard, OpenAI-function shape,
  Claude Desktop builtin WebSearch, bare web_search, missing type)
- block-builder shape (results, none, empty)
- pre-request hook flag-setting (native sets, standard does not)
- async_build_agentic_loop_plan attaches blocks to plan.metadata when
  the flag is present, leaves metadata untouched when absent
- post-hook injection into dict and object responses
- legacy _execute_agentic_loop mirrors the injection so both paths
  return the same shape

* test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return

* test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return

* feat(websearch_interception): emit native blocks from try_short_circuit_search

The agentic-loop post-hook only fires when the model returns a tool_use
block. Cowork / Claude Desktop on Bedrock actually make TWO requests
per user turn: the main /v1/messages with their builtin tool, and a
separate standalone /v1/messages whose only tool is
web_search_20250305. That second request hits try_short_circuit_search
— no agentic loop, no post-hook — and was returning text-only, leaving
the citations panel empty.

When the short-circuit input carries a native web_search_* tool, build
a synthetic server_tool_use + web_search_tool_result pair (using the
structured SearchResponse already returned by _execute_search) so the
client gets the native shape it expects. The legacy text block is
preserved so non-native short-circuit callers (Claude Code,
github_copilot, etc.) see the same payload as before.

Failure path still emits the native block pair (with empty results)
plus the text-error block, so the client gets a well-formed response
rather than a malformed half-shape.

* test(websearch_native_blocks): cover short-circuit native-block emission

Three new cases on top of the existing 18:
- native web_search_20250305 short-circuit → [server_tool_use,
  web_search_tool_result, text], ids paired, urls/titles carried.
- litellm_web_search short-circuit → text-only (no regression).
- native short-circuit on search failure → still emits the native
  block pair (empty results) plus the text-error block, so the client
  never sees a malformed half-shape.

* test(websearch_short_circuit): index assertions by block type, not by position

Native short-circuit responses now have [server_tool_use,
web_search_tool_result, text] when the input carries
web_search_20250305 — find the text block by type rather than relying
on content[0].

* fix(websearch_interception): gate legacy WebSearch name on schema absence

Clients like Cowork / Claude Desktop ship a client-side tool named
"WebSearch" with a full input_schema — they handle it themselves and
expect to make a separate native web_search_20250305 sub-request for
the actual search.

Today is_web_search_tool matches the bare name regardless of other
fields, which hijacks the client's tool server-side. The agentic loop
fires on the main request, the model never gets to emit the
client-side tool_use, and the separate native sub-request (where
citation data flows) is never made. Net: citations panel empty.

Real Anthropic client tools always carry input_schema (the API rejects
them otherwise), so a bare {name: "WebSearch"} with no schema is the
only thing that could be a legacy interception marker. Gate the match
on schema absence: legacy callers (if any) keep working, real
client-side WebSearch tools pass through untouched.

* fix(websearch_interception): drop "WebSearch" from response-detection lists

Post-conversion the model always sees ``litellm_web_search``, so the
"WebSearch" entry in the response-side tool_use detection lists was
dead at best. If a model ever did return ``tool_use(name="WebSearch")``
it would now (incorrectly) hijack the client's own ``WebSearch`` tool
again — same Cowork problem we just fixed on the input side. Drop it.

* test(websearch_native_blocks): cover the WebSearch legacy-name schema gate

Three new cases:
- {name: "WebSearch"} (bare interception marker) → still matched
- {name: "WebSearch", input_schema: {...}} (Cowork client tool) →
  passes through untouched
- {name: "WebSearch", description: "..."} (no schema) → still matched
  on the assumption it's a legacy marker rather than a malformed real
  client tool.

---------

Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
pytest-cov runs with --cov=litellm, which makes coverage.xml store paths
relative to the package root (e.g. `proxy/proxy_server.py` instead of
`litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when
the basename is unique in the repo. Files like proxy_server.py, router.py,
utils.py, main.py, and constants.py — which have duplicates under
enterprise/ or other subpackages — get silently dropped during ingest.

The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded
path so they resolve unambiguously. Confirmed against multiple recent
coverage.xml artifacts that no uploader currently emits paths already
prefixed with `litellm/`, so the rule is safe to apply universally.

This restores Codecov visibility for the highest-fix-rate hotspots:
proxy_server.py, router.py, proxy/utils.py, litellm_logging.py,
constants.py, key_management_endpoints.py, utils.py, main.py,
user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py.
Audit of .github/workflows/ via gh run history shows the following have
either never run or have been dormant for 10+ weeks. CI coverage that
still matters is preserved on CircleCI (e.g. llm_translation_testing).

Removed workflows:
- test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled);
  CCI local_testing_part1/2 covers the same tests
- llm-translation-testing.yml — last run 2025-07-10; replaced by CCI
  llm_translation_testing job (run_llm_translation_tests.py kept for the
  make test-llm-translation target)
- run_observatory_tests.yml — last run 2026-03-03 (cancelled)
- scan_duplicate_issues.yml — last run 2026-03-02 (failure)
- publish_to_pypi.yml — never run
- read_pyproject_version.yml — fires on every push to main but its echoed
  version output is not consumed by any downstream step

Removed orphan files (no callers in workflows, CCI, or Makefile):
- .github/workflows/README.md — documented only publish_to_pypi.yml
- .github/workflows/update_release.py + results_stats.csv
- .github/actions/helm-oci-chart-releaser/
This reverts commit e25a988.

The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's
auto-resolution, not before. Files with unique basenames (which were
auto-resolving correctly to `litellm/<path>`) got an extra `litellm/`
prepended, producing `litellm/litellm/<path>` storage. Files with
ambiguous basenames (the actual target of the fix) continued to be
dropped because the auto-resolution still failed for them.

Net result on the verification run: 1375 files now stored under
unresolvable `litellm/litellm/...` paths, and the 11 originally-missing
hotspots are still missing. Reverting before piling on further changes.
…y-and-fixes-c5bc

test(vcr): classify cache verdicts, surface cost leaks, and fix the two biggest leakers
…act vi.mock

Per-file `vi.mock("@tremor/react", ...)` factories fully replace the
setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip
overrides are lost in any file that re-mocks `@tremor/react`. Without
them, the real Tremor `<Button>` leaks through and its internal
`useTooltip(300)` schedules a native 300ms `setTimeout` on pointer
events. When the test environment is torn down before the timer fires,
the trailing `setState` calls `getCurrentEventPriority`, which reads
`window.event` against a destroyed jsdom -> "window is not defined"
flake observed on CI.

Patches the 7 leaky test files to re-supply `Button` (bare `<button>`)
and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops
a dead `afterEach` workaround in `user_edit_view.test.tsx` (the
fake-timer dance it ran could not drain a real timer scheduled before
the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`.
…decov

pytest-cov treats --cov=<module-name> as a Python package and emits XML
paths relative to the package root, stripping the litellm/ prefix
(`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`).
Codecov's auto-prefix heuristic then drops every file whose basename is
ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/),
`router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py`
(2). The 11 highest-fix-rate hotspots have never appeared in Codecov.

Switching to --cov=./litellm treats the argument as a path, which makes
coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`).
Each path is unambiguous, so Codecov resolves all files correctly.

Verified locally: rerunning a single proxy_unit_tests test with
--cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`,
`filename="litellm/router.py"`, and `filename="litellm/types/router.py"`
as distinct entries — exactly the disambiguation Codecov needs.

Touches every workflow that uploads coverage: the two reusable GHA
workflows (_test-unit-base.yml, _test-unit-services-base.yml),
test-mcp.yml, and all 14 invocations in .circleci/config.yml.
chore(ci): remove unused GitHub Actions workflows and orphan files
Remove available_on_public_internet gating from delegate-auth-to-upstream
paths so oauth2 + delegate_auth_to_upstream interactive servers behave
the same when marked internal. Keeps M2M exclusion. Updates tests.
test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock
Log verbose_logger.warning when loading oauth2 interactive servers with
available_on_public_internet=false and delegate_auth_to_upstream=true
(config + DB). Dashboard Alert for the same combo. CLAUDE note for
operators. Tests for log and M2M skip.
yuneng-berri and others added 23 commits May 15, 2026 22:34
Mock the image fetch instead of downloading a 50MB+ image from
upload.wikimedia.org. The runner was intermittently rate-limited
(HTTP 429), so the code raised "Unable to fetch image ... Status
code: 429" and the size-limit assertions failed even though
pytest.raises(litellm.ImageFetchError) still matched.

Mirror the established LargeImageClient pattern in
tests/test_litellm/litellm_core_utils/test_image_handling.py: stub
litellm.module_level_client with a response whose Content-Length
exceeds the 50MB limit and bypass SSRF validation, so the
size-limit rejection path is exercised deterministically with no
external network dependency.
The Content-Length header check in _process_image_response rejects the
image before the body is streamed, so the mock body never needs to be
materialized. Use an empty body instead of b"x" * 100MB (addresses
greptile/cursor review feedback).
test(gemini): de-flake test_gemini_image_size_limit_exceeded
… openapi field

Two CI failures, both pre-existing in different ways:

1. reasoning_effort_grid: all 33 bedrock_invoke_messages cells failed with
   AttributeError("module 'litellm' has no attribute 'messages'"). litellm
   exposes the async Anthropic Messages entrypoint as litellm.anthropic_messages
   (via "from .llms.anthropic.experimental_pass_through.messages.handler
   import *" in litellm/__init__.py), not litellm.messages.acreate. Swap
   the call.

2. tests/test_litellm/interactions/test_openapi_compliance.py::TestResponseCompliance::test_interaction_response_fields
   asserts the live Google spec contains "steps". Google's spec has churned
   through "outputs" -> "steps" -> neither, and presently carries neither.
   The test broke on main as soon as upstream dropped "steps"; pulling the
   key off the assert list realigns the test with the live schema. Re-add
   the per-turn output field once upstream stabilizes on a name.

The openapi-compliance fix doesn't belong to this PR conceptually but is
included here per request to unblock CI before the morning.
… not class

The anthropic_messages route wraps client-side BadRequestError as
AnthropicError (a BaseLLMException subclass) with status_code=400, so
"except BadRequestError" missed those cells and they fell through to the
generic Exception arm, returning 500 instead of the expected 400.

Replace the isinstance-on-BadRequestError check with a tiny classifier
that prefers BadRequestError membership, then falls back to the exception's
status_code attribute (set by every BaseLLMException subclass), then 500.
Apply to both _call_chat and _call_messages for consistency.

Fixes the 13 CircleCI llm_translation_testing failures on
bedrock_invoke_messages cells where the effort was disabled / invalid /
empty / xhigh-on-unsupported / max-on-unsupported.
Four pre-existing flakes on main that gate this branch's workflow even
though they're unrelated to the reasoning_effort_grid suite:

1. tests/local_testing/test_completion.py::test_completion_fireworks_ai
2. tests/local_testing/test_completion_cost.py::test_completion_cost_fireworks_ai[fireworks_ai/llama-v3p3-70b-instruct]
3. tests/llm_translation/test_fireworks_ai_translation.py::test_document_inlining_example[False]

   The Fireworks-hosted `llama-v3p3-70b-instruct` deployment is currently
   returning 404 "Model not found, inaccessible, and/or not deployed".
   These tests pass when the model is deployed; the issue is upstream
   capacity, not our code path. Wrap the live call in a try/except that
   pytest.skip's on litellm.NotFoundError so a Fireworks deployment hiccup
   no longer fails CI for unrelated PRs.

4. tests/llm_translation/test_gemini.py::test_gemini_image_size_limit_exceeded

   The test fetches the 32MB "Blue Marble 2002" image from Wikimedia to
   exercise the 50MB image-size cap. CI runners share an IP pool with
   noisy traffic, so Wikimedia routinely returns HTTP 429. The size-limit
   check never gets a chance to fire. Catch the 429 BadRequestError and
   pytest.skip in that case.

None of these belong on this PR conceptually, but they're included per
request to unblock the workflow before morning.
…ageFetchError

litellm.ImageFetchError is a subclass of BadRequestError, so when
Wikimedia returns 429 the pytest.raises(ImageFetchError) block matches
and swallows the exception -- the outer try/except never fires. Drop the
try/except and check the captured error message for "Status code: 429"
after the raises block, calling pytest.skip in that case. Same intent,
right control flow.
…view

Two P2 nits flagged by Greptile on PR 28036:

1. _build_completion_kwargs() defaulted vertex_project to "vertex-check-481318"
   when VERTEX_PROJECT was unset. That value is a specific GCP project that
   doesn't belong to this repo, so if the env-var skip guard were ever
   bypassed (misconfig, direct helper call), the test would silently issue
   calls to a foreign project rather than failing loudly. Drop the fallback
   and read os.environ["VERTEX_PROJECT"] directly, mirroring how
   AZURE_FOUNDRY_* are handled.

2. _build_messages_kwargs() was a one-liner that returned the result of
   _build_completion_kwargs() unchanged -- a dead abstraction with one
   caller. Inline at the _call_messages call site and delete the helper.
…s-cZRwz

Resolve conflicts in the five unrelated CI-flake fixes I previously landed
on this branch -- staging shipped stronger versions (mocked HTTP for the
Fireworks tests, mocked image-fetch for the Gemini size-limit test, switched
the openapi-compliance test to the Interaction response schema instead of
dropping the assertion). Take staging's version of all five files and drop
my now-unreachable 429-skip lines from the Gemini test that the auto-merge
left behind.
…nd migrations (#27557)

Split the monolithic LiteLLM proxy into independently scalable Kubernetes components to allow separate horizontal scaling of the LLM data plane and management API surfaces

- Add DatabaseURLSettings pydantic-settings model that assembles DATABASE_URL (and optional DATABASE_URL_READ_REPLICA) from discrete DATABASE_* env vars before Prisma initializes, supporting both IAM token auth (minting short-lived RDS tokens) and password auth; replaces the CLI-only path that componentized entrypoints bypass
- Add gateway component (port 4000) that trims the proxy route table to the LLM data-plane surface (chat, embeddings, completions, audio, realtime, provider passthroughs, health/metrics) via an allowlist applied inside the lifespan context so plugin-registered routes are captured
- Add backend component (port 4001) that exposes the management/admin surface (keys, users, teams, orgs, spend analytics, model management, SSO, audit logs) with a complementary allowlist
- Add ui component — Next.js static export served by nginx (port 3000) with RSC payload routing, asset prefix aliasing, and SPA fallback for dashboard routes
- Add migrations component with dedicated Dockerfile that runs prisma migrate deploy via a Helm pre-install/pre-upgrade Job, eliminating per-pod schema contention on the Prisma advisory lock
- Add Helm chart (helm/litellm) with separate Deployments, Services, HPAs, and ConfigMap for each component; shared _helpers.tpl emits DATABASE_*, IAM_TOKEN_DB_AUTH, REDIS_*, and DISABLE_SCHEMA_UPDATE env vars from chart values; ingress template routes traffic to the correct component by path prefix
- Add comprehensive tests for DatabaseURLSettings covering IAM auth, password auth, read replica fallbacks, operator-pinned URL preservation, and percent-encoding; add coverage test asserting gateway + backend allowlist union equals the full proxy route set
- Add pydantic-settings>=2.14.1 as a proxy extra dependency and update liccheck allowlist

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
test(ci): add reasoning_effort grid e2e regression suite
…ps (#28028)

* fix(ci): flag codecov uploads and enable carryforward

Coverage uploads from GHA and CircleCI were unflagged. Commits that
receive the push-triggered workflows more than once (re-runs, or branches
cut at the same SHA) accumulated many overlapping flagless sessions, and
Codecov's per-commit merge dropped the largest, ubiquitously-imported
files (router.py, proxy_server.py, main.py, utils.py, cost_calculator.py)
from the report even though the uploaded XMLs contained them.

- codecov.yaml: flag_management.default_rules.carryforward: true
- GHA reusable bases: tag each upload with its workflow/shard name
- CircleCI: tag the combined upload "circleci"; also combine the
  agent / google_generate_content_endpoint / litellm_utils datafiles
  that were produced and required but missing from the combine list

* fix(ci): close coverage gaps in proxy-legacy, router-unit, auth-ui, caching-redis

- test-unit-proxy-legacy: route through _test-unit-base so the full
  proxy_unit_tests suite (incl. comprehensive test_proxy_server*.py) is
  measured and uploaded with per-group flags (was plain pytest, no --cov)
- _test-unit-services-base: declare the enable-redis input + the six
  secrets test-unit-caching-redis passes; that workflow had a workflow_call
  signature mismatch and startup_failed on every push (never ran).
  Changes are additive/optional - proxy-db and security callers unchanged
- circleci: add --cov + persist + combine + upload-coverage requires for
  litellm_router_unit_testing (tests/router_unit_tests) and
  auth_ui_unit_tests (tests/proxy_admin_ui_tests); neither was covered
  anywhere. Redundant -k subset jobs left as-is (local_testing covers them)

* fix(ci): remove dead GHA Redis workflow; keep Redis on CircleCI only

CircleCI redis_caching_unit_tests already runs the exact same files
(tests/local_testing/test_dual_cache.py, test_redis_batch_optimizations.py,
test_router_utils.py) with --cov, and that datafile is already combined
and uploaded. The GHA test-unit-caching-redis workflow was redundant and
had never run (workflow_call signature mismatch -> startup_failure on
every push).

- Delete .github/workflows/test-unit-caching-redis.yml
- Revert _test-unit-services-base.yml to the flag-fix state (drop the
  enable-redis input / secrets / env wiring added only to prop up the
  GHA Redis workflow); the verified per-upload flags line is kept
- The only single-star "litellm_*" branch glob lived in the deleted
  file; no other single-star globs exist, so none remain to widen

* fix(ci): keep proxy-legacy as a standalone job to preserve required check names

Routing proxy-legacy through the reusable workflow renamed each check from
the bare matrix name (e.g. "proxy-response-and-misc") to
"proxy-response-and-misc / Run tests". Those bare names are required status
checks in branch protection, so the old contexts never reported and PRs sat
"Expected — Waiting for status to be reported" indefinitely.

Restore the original standalone matrix job (job name == matrix name, so the
required contexts report again) and add coverage in place: --cov on pytest
plus an OIDC Codecov upload flagged proxy-legacy-<group>. Net effect of the
gap-#2 fix is preserved (flagged coverage for tests/proxy_unit_tests/**)
without changing any check name.

* revert(ci): drop all proxy-legacy changes from this PR

tests/proxy_unit_tests/** is already fully covered by test-unit-proxy-db
(its shard-coverage guard fails CI if any file in that dir is unassigned),
which this PR already flags + carryforwards. Adding --cov and id-token:write
to the legacy pull_request job was redundant and put OIDC on a job that runs
untrusted PR code. Restore the file to the base version verbatim so this PR
no longer touches proxy-legacy at all (also restores its original required
check names). Retiring proxy-legacy in favor of proxy-db on pull_request is
a separate effort that needs a branch-protection change.
… code, route/path, preprocessing latency) (#28040)

* feat(otel): expose http.response.status_code on failure spans

Set the OTel-standard http.response.status_code (integer) on failure
spans alongside the existing OpenInference error.code (kept for
back-compat). error.type is already emitted via ERROR_TYPE.

Crucially, also record structured error attributes on the proxy SERVER
span ('Received Proxy Server Request') from async_post_call_failure_hook
- the only place the SERVER span is in hand. _handle_failure records on
the litellm_request child span (the parent span is not propagated into
its kwargs), so prior to this change the SERVER span that dashboards
query carried only span status, never error.code/error.type. Reuses
_record_exception_on_span + StandardLoggingPayloadSetup.get_error_information
so values match the child span.

Tests: recorder unit coverage + a hook-driven test asserting the SERVER
span is stamped (the gap recorder-only tests missed). Full
test_opentelemetry.py suite: 197 passed.

* feat(otel): set http.route + url.path on the proxy SERVER span

Add the OTel-standard http.route (low-cardinality route template, e.g.
/v1/threads/{thread_id}/runs) and url.path (literal path) to the SERVER
span ('Received Proxy Server Request') so dashboards can group traffic
by endpoint instead of seeing every path param as a unique value.

Same architectural gap as the status-code commit: the success/failure
logging handlers write the litellm_request CHILD span, and
_handle_success explicitly refuses to copy to the SERVER span. Verified
with a console-exporter run that the SERVER span was bare on success.

Unlike error info, route/path are known at request time, so set them
directly on the freshly-created SERVER span in user_api_key_auth (one
edit point, works for success and failure, no hook-ordering risk):
- http.route from the matched FastAPI route (scope['route'].path),
  empirically confirmed populated at auth-dependency time.
- url.path from the existing literal-path variable.
New get_request_route_template helper + set_proxy_request_route_attributes
(no-op on None span, so the Langfuse override stays safe).

Tests: route-attribute setter + route-template helper edges. Full
test_opentelemetry.py and test_auth_utils.py green.

* feat(otel): set litellm.preprocessing.duration_ms on the proxy SERVER span

Expose the total time LiteLLM spends before the upstream provider
request begins (auth + parsing + pre-call hooks) as a single number on
the SERVER span ('Received Proxy Server Request'). Window:
proxy-receive -> FIRST provider handoff.

Retry semantics: first attempt only (pure preprocessing, excludes
retry loops + backoff). api_call_start_time is overwritten on every
attempt, so a set-once first_api_call_start_time pins the first handoff.

Same architectural gap as the prior two commits: the success/failure
logging handlers write the litellm_request CHILD span, not the SERVER
span. Set it instead from the post-call hooks on
user_api_key_dict.parent_otel_span.

Failure-path subtlety: request_data.pop('litellm_logging_obj') runs
before the failure-hook loop, so the failure hook can't read the
logging object. litellm_received_at is propagated via the existing
request->metadata channel, and first_api_call_start_time is mirrored
onto litellm_params.metadata, so both anchors survive into request_data
and the OTel helper reads them uniformly for success and failure.

Edits: user_api_key_auth (stash receive instant), litellm_pre_call_utils
(propagate it), litellm_logging (set-once first handoff + metadata
mirror), opentelemetry (constant + set_preprocessing_duration_attribute,
called from both post-call hooks).

Tests: duration helper (both container shapes, missing/negative/None
edges) + set-once invariant (retry doesn't overwrite, metadata mirror).
test_opentelemetry.py + test_auth_utils.py + test_litellm_logging.py:
447 passed. Verified live: SERVER span carries the attribute on success
and failure, coexisting with the status-code and route attributes.

* fix(otel): MyPy type-narrowing for status-code + preprocessing-duration

No behavior change. MyPy (CI lint) flagged:
- error_information["error_code"] is str|None: narrow via a None-checked
  local before int().
- _to_timestamp returns Optional[float]: resolve both anchors and return
  early if either is None instead of subtracting possibly-None floats.

* fix(otel): stop polluting user request metadata with first_api_call_start_time

The PR3 set-once preprocessing anchor was mirrored into
litellm_params["metadata"] from core litellm_logging.py. That dict is
the caller's request metadata, mutated in place and shared across every
call path including pure SDK (litellm.acreate_batch). It got echoed into
LiteLLMBatch(metadata=...), which the OpenAI batch schema types as
Dict[str, str] -> pydantic ValidationError on a datetime value.

- litellm_logging.py: set first_api_call_start_time only on
  model_call_details (success path reads it there directly).
- proxy/utils.py: post_call_failure_hook lifts it off the logging object
  into request_data (internal top-level key, same convention as the
  other proxy-internal request_data keys) right before the existing
  litellm_logging_obj pop. Never touches user metadata.
- opentelemetry.py: read the anchor from the container top level
  (model_call_details on success, request_data on failure).
- Tests updated; add TestPostCallFailureHookLiftsFirstApiCallStartTime.

Fixes the batches_testing regression introduced on this branch.

* chore(otel): trim verbose comments to concise rationale

Collapse multi-line why-blocks to one or two lines and drop process/plan references (PR-numbering, "the plan") from test comments. No behavior change.
openai 2.34.0 began rejecting an explicitly-passed empty-string api_key
at client construction (raises OpenAIError before any request), which
broke tests/local_testing/test_exceptions.py::test_exception_with_headers
and related cases after uv.lock floated openai 2.33.0 -> 2.36.0.

Pin back to 2.33.0 (within the existing pyproject >=2.20.0,<3.0.0 range)
as a temporary stopgap; longer-term fix to follow.
)

* feat(model_catalog): add Azure AI Foundry GPT-5.4 model metadata

Register azure_ai GPT-5.4 variants with pricing, context limits from
Foundry catalog, and capability flags for cost routing and tooling.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(model_catalog): tighten Azure AI GPT-5.4 cost and capability metadata

Add supports_web_search for base GPT-5.4 aliases, priority-tier Pro rates,
and mini/nano above-272k plus priority pricing for correct spend math.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(model_catalog): sync web_search flag on Azure AI GPT-5.4 dated backup row

Mirror supports_web_search for azure_ai/gpt-5.4-2026-03-05 in the backup
catalog so it matches model_prices_and_context_window.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
…28090)

The proxy SERVER span ("Received Proxy Server Request") only carried
http.response.status_code on failures (set in _record_exception_on_span),
so success traces had no 2xx bucket — error-ratio and status-breakdown
dashboards were missing their denominator and the span violated the HTTP
semconv (the attribute is required whenever a response is sent). Add a
set_response_status_code_attribute helper and call it from
async_post_call_success_hook with 200, symmetric with the failure path
and the existing route/preprocessing-duration SERVER-span attributes.
…fo (#28079)

* fix(proxy): sort BYOK models by team_public_model_name in /v2/model/info

Team BYOK rows persist an internal `model_name` like
`model_name_{team_id}_{uuid}` and expose the user-facing name via
`model_info.team_public_model_name`. The UI's `getDisplayModelName`
and the search filter already fall back to that field, but
`_sort_models` was keying off the raw `model_name` — so BYOK rows
ranked by their opaque IDs and clumped at the end of the alphabetized
list instead of interleaving with non-BYOK rows.

Match the UI/search behavior: prefer `team_public_model_name` when
present, fall back to `model_name` otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): case-insensitive DB-side search for BYOK models

`_apply_search_filter_to_models` used Prisma's JSON path
`string_contains` to match the BYOK `team_public_model_name` field, but
that operator is case-sensitive in Postgres (no `mode: insensitive`
flag like column-level string filters have). So a search for "claude"
missed a stored "Claude Sonnet" via the DB branch even though the
router-side path matched it case-insensitively.

Widen the JSON branch to "row has a team_public_model_name set" and
filter case-insensitively in Python so DB-only BYOK rows match the
same terms users see in the UI. This also drops the now-unused
DB-level page-size optimization and `sort_by` knob — the in-Python
filter is the source of truth for `db_models_total_count` now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): scope BYOK search results to caller's accessible teams

`_apply_search_filter_to_models` was widened to fetch every row with a
`team_public_model_name` set so case-insensitive search could match
mixed-case stored names. `/v2/model/info` is reachable by non-admin
keys though, and the helper ran before `include_team_models` / `teamId`
filtering — so a non-admin caller could search a common substring like
"claude" and see BYOK rows belonging to teams they're not a member of.

Resolve the caller's team membership once (admin → no scoping, else
their `user_row.teams`) and drop BYOK rows (those with
`model_info.team_id` set) outside that scope on both the router-side
matches and the over-broad DB query, before display-name matching.
Non-team rows are unaffected and remain gated by the existing
`include_team_models` / `direct_access` paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): search by team_public_model_name and scope teamId queries

- /v2/model/info search now matches both `model_name` and
  `model_info.team_public_model_name`, so team BYOK rows (which persist
  an internal `model_name_{team_id}_{uuid}`) are findable by the public
  name shown in the UI. DB query OR-includes a JSON-path match on
  `team_public_model_name` for rows that exist only in the DB.
- `_filter_models_by_team_id` no longer short-circuits on the viewer's
  `direct_access` flag — that describes the admin viewer's own
  permissions and would leak every public model into a team-scoped view.
  Models are kept only when they belong to the team (own BYOK, in
  access_via_team_ids, or reachable via team.models / access groups).
- Added `_authorize_team_id_query`: the untrusted `teamId` query
  parameter now requires the caller to be a proxy admin or a member of
  the requested team, otherwise returns 403. Without this, any
  authenticated user could enumerate another team's BYOK metadata by
  guessing the team id.
- `_get_caller_byok_team_scope` now treats `PROXY_ADMIN_VIEW_ONLY` the
  same as `PROXY_ADMIN` (both are admin roles); previously VIEW_ONLY
  admins fell through to a user-id team lookup and saw only their own
  teams' BYOK rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): bound BYOK search DB fetch in /v2/model/info

Previously the DB-side search OR'd a JSON-path predicate
`{model_info: {path: [team_public_model_name], string_contains: ""}}`
to compensate for Prisma's case-sensitive JSON `string_contains` on
Postgres. That predicate matches every row that has any
`team_public_model_name` set, so any authenticated caller could force a
full BYOK-table read with `/v2/model/info?search=x` regardless of page
size.

Drop the JSON-path branch. The DB query now does a bounded
`model_name contains <search>` lookup. BYOK rows that are loaded into
the router are still searchable by their `team_public_model_name` via
the router-side filter; only the rare edge case of a BYOK row that
exists only in the DB (router sync failed) loses display-name search,
which is an acceptable trade-off given the DoS surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(proxy): bound DB find_many in /v2/model/info search

The previous bounding patch dropped the page-aware `take=N` on
`find_many`, so a broad `?search=model` would load and decrypt every
matching DB row on each request even though the response only returns
one page.

Restore bounded fetches in `_apply_search_filter_to_models`:

* Unsorted searches use `take = max(0, page * size - router_count)`,
  i.e. exactly one page worth of remaining DB rows.
* Sorted searches need ordering across the full match set, so they cap
  at `_SORTED_SEARCH_DB_FETCH_CAP = 500` instead of fetching everything.
* Total count comes from a cheap `count(...)` query so pagination stays
  accurate without materializing every row.

Wired `page`, `size`, and `sortBy` through from the endpoint and added
a regression test covering both `take` values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(proxy): extract DB-fetch helper to satisfy PLR0915

_apply_search_filter_to_models tripped Ruff's "too many statements"
(51 > 50) after the bounded-fetch fix. Move the DB-side block into
`_fetch_db_models_for_search`, which keeps the same behavior:

* Bounded `take` via page math (unsorted) or `_SORTED_SEARCH_DB_FETCH_CAP`
  (sorted)
* Cheap `count(...)` for accurate pagination totals
* Caller-team scope applied to fetched rows before decrypt

Pure refactor; no behavior change. All 8 BYOK/team tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: apply black formatting to _fetch_db_models_for_search

CI's "Check Black formatting" step flagged one line in the helper added
in d55eecf. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add AWS ECS Fargate stack with Aurora Postgres (IAM auth), ElastiCache Redis, S3, ALB with path-based routing to gateway/backend/ui components, Application Auto Scaling, and automated DB bootstrap + prisma migration via local-exec provisioners
- Add GCP Cloud Run stack with Cloud SQL Postgres (password auth), Memorystore Redis, GCS, external HTTPS load balancer with serverless NEGs and URL map routing, and automated prisma migration via Cloud Run Job
- Both stacks support typed proxy_config input mirroring the helm chart's gateway.config.proxy_config, per-component extra env vars, and Secret Manager references for provider API keys
- Gateway/backend services depend on terraform_data.migration so they never start before the schema is in place, eliminating crash-loop windows on first apply
- AWS stack uses IAM database authentication with a one-shot Fargate bootstrap task that creates and grants the rds_iam role to the application user; GCP stack uses password auth assembled at container startup to avoid Cloud SQL Auth Proxy sidecar complexity
- Add .gitignore rules for Terraform state files, plan files, tfvars inputs, provider binaries, and crash logs while explicitly keeping .terraform.lock.hcl for provider version pinning
- Include terraform.tfvars.example files, provider lock files, and comprehensive README documentation covering architecture, TLS setup, image pull strategies, and quick-start instructions for both stacks

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
…{"detail":"invalid_request"} (#28086)

* fix(mcp-oauth): add PROXY_BASE_URL escape hatch + diagnostic logging for invalid_request

Customers hitting "{"detail":"invalid_request"}" on the MCP /authorize
endpoint had no way to recover when their ingress mangles X-Forwarded-*
headers (the same-origin check in validate_trusted_redirect_uri compares
the browser-supplied redirect_uri against get_request_base_url, which is
reconstructed from those headers).

Two contained changes:

  1. get_request_base_url now honours PROXY_BASE_URL as the canonical
     public origin when set, bypassing the X-Forwarded-* trust gate
     entirely. Operators who know their public URL can set it once
     instead of debugging ingress header rewrites.

  2. The rejection path in validate_trusted_redirect_uri emits a WARN
     log carrying the redirect_uri, computed proxy base, and the
     X-Forwarded-* / Host headers seen. A bare 400 was undiagnosable;
     this turns it into a one-line root-cause.

* test(mcp-oauth): capture warnings from correct logger ("LiteLLM")

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(mcp-oauth): reject malformed PROXY_BASE_URL with one-shot diagnostic

A scheme-less PROXY_BASE_URL (e.g. "litellm.example.com" instead of
"https://litellm.example.com") would sail through urlparse with empty
scheme + netloc, silently breaking every same-origin compare in
validate_trusted_redirect_uri and leaving the operator staring at the
same opaque 400 the env var was meant to fix.

Validate it once at read time: only honour values that parse as
http(s) URLs with a non-empty netloc; otherwise log a one-shot WARN
naming the bad value and fall through to the request-derived origin
so the proxy still serves traffic.

* fix(mcp/oauth): normalize PROXY_BASE_URL to strip query/fragment

Match the X-Forwarded-* path's normalization so a configured
PROXY_BASE_URL containing a query string or fragment does not break
downstream f-string concatenation like f"{base_url}/callback".

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(mcp-oauth): drop non-essential comments from PROXY_BASE_URL changes

Strip narrative comments and verbose docstrings added in this PR; the
code is intuitive enough on its own and the log messages already carry
their own diagnostic context. Pre-existing comments are left untouched.

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
* bump: version 0.1.40 → 0.1.41

* bump: version 1.85.0 → 1.86.0

* add uv lock
@greptile-apps

greptile-apps Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Too many files changed for review. (625 files found, 100 file limit)

@CLAassistant

CLAassistant commented May 17, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
9 out of 15 committers have signed the CLA.

✅ vladpolevoi
✅ ryan-crabbe-berri
✅ dennishenry
✅ mateo-berri
✅ yuneng-berri
✅ shivamrawat1
✅ Michael-RZ-Berri
✅ Sameerlite
✅ kenany
❌ claude
❌ krrish-berri-2
❌ yassin-berriai
❌ ishaan-berri
❌ shin-berri
❌ cursoragent
You have signed the CLA already but the status is still pending? Let us recheck it.

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codspeed-hq

codspeed-hq Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing litellm_internal_staging (cf9b5e4) with main (e58a561)

Open in CodSpeed

@yuneng-berri yuneng-berri merged commit a72414a into main May 17, 2026
110 of 131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.