[Infra] Promote internal staging to main by yuneng-berri · Pull Request #28100 · BerriAI/litellm

yuneng-berri · 2026-05-17T01:32:15Z

Relevant issues

Linear ticket

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

…eaks Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS / PARTIAL' tag into a classified outcome that distinguishes the cases that silently bill the live API on every CI run from the ones that don't: HIT pure replay PARTIAL mixed replay + new recordings MISS:RECORDED new cassette saved to Redis (cached next run) MISS:OVERFLOW cassette > MAX_EPISODES_PER_CASSETTE; persister refused to save; re-bills every run MISS:NOT_PERSISTED test failed; save_cassette skipped; re-bills NOOP VCR-marked but no HTTP traffic (mocked elsewhere) UNMARKED:LIVE_CALL test bypassed VCR AND opened a TCP connection to a known LLM provider host -> wasted spend UNMARKED:NO_TRAFFIC test bypassed VCR but didn't call out The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits live' into 'this test connected to api.openai.com'. We install a socket.connect / socket.create_connection wrapper for the duration of each non-VCR-marked test and record any outbound TCP to a known LLM provider hostname. The probe sits below the httpx layer so vcrpy and respx (which both patch above the socket) are unaffected. Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the llm_translation and local_testing conftests with per-item respx detection in apply_vcr_auto_marker_to_items. A test now skips VCR when it actually carries @pytest.mark.respx or has respx_mock in its fixture chain - not just because some other test in the same file imports MockRouter. Items skipped by skip_files are split into respx_conflict (real conflict, the module wires up respx) vs file_opt_out (dead skip- list entry whose module never touches respx) so the session summary makes pruning obvious. Stabilize the AWS SigV4 fingerprint: the Authorization header on Bedrock requests rotates its Credential date and Signature on every call, which previously pushed every Bedrock test past the 50-episode overflow threshold. Extract the access-key id only ('aws-sigv4:AKIA...') so two requests with the same identity match. Always emit verdict logging when VCR is active (set LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a session-end classification summary that lists overflow tests, unmarked live-call tests, and the skip-reason breakdown. Wire the live-call probe + summary hook into every test directory that already uses the Redis-backed VCR cache (audio_tests, guardrails_tests, image_gen_tests, litellm_utils_tests, llm_responses_api_testing, llm_translation, local_testing, logging_callback_tests, ocr_tests, pass_through_unit_tests, router_unit_tests, search_tests, unified_google_tests). Add tests/llm_translation/test_vcr_classification.py covering the verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability, live-host classification, and session summary rendering. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

These seven test files were on _RESPX_CONFLICTING_FILES, which made the auto-marker skip them entirely. Inspecting the source shows the only respx artifact is a top-level 'from respx import MockRouter' that no test ever uses - no @pytest.mark.respx, no respx_mock fixture, no respx.mock context manager. The import is dead code left over from a previous mocking pattern. Now that apply_vcr_auto_marker_to_items detects respx per-item via the marker / fixture chain (b637d9f), the file-level skip is no longer needed for these files - they were the reason the OpenAI tests (test_o3_reasoning_effort, test_streaming_response[o1/o3-mini], TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search, TestOpenAIO3::test_web_search, etc.) ran live every CI build despite the cassette cache being healthy. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

…en module-level file handles Module-level TEST_IMAGES = [ open(os.path.join(pwd, 'ishaan_github.png'), 'rb'), open(os.path.join(pwd, 'litellm_site.png'), 'rb'), ] SINGLE_TEST_IMAGE = open(...) opens the file once at import. After the first multipart upload, the file pointer is at EOF, so every subsequent test in the same xdist worker sends an empty multipart body. That non-determinism (a) blows the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so _RedisPersister.save_cassette refuses to save it, and (b) re-bills the live image edit endpoint on every CI run. Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py shows six tests parking at 51-52 cassette entries (TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False], TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio, test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False], test_multiple_image_edit_with_different_formats). Replace the module-level file handles with _make_test_images() / _make_single_test_image() factories that return fresh _RewindableImage (BytesIO subclass) objects whose pointer always starts at 0. The image bytes are read once at import into module-level constants (_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is unchanged. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com' (region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit host check for that pattern so Bedrock live calls are visible to the probe, and update the unit test accordingly. Also drop the unused '_LIVE_CALL_PROBE_INSTALLED' module variable.

… upload The _RewindableImage(BytesIO) wrapper auto-rewound on every read after EOF, which made the OpenAI SDK's multipart upload writer read the same bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd: [gw0] node down: Not properly terminated replacing crashed worker gw0 ... worker 'gw1' crashed while running 'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]' The auto-rewind was added defensively for parametrized + flaky-retried tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk already calls get_base_image_edit_call_args() once per invocation and that helper now constructs fresh streams via _make_test_images(), so rewinding inside the stream is unnecessary. Replace with plain BytesIO seeded with the cached image bytes. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

The pass_through prompt-caching tests (test_prompt_caching_returns_cache_read_tokens_on_second_call, test_prompt_caching_streaming_second_call_returns_cache_read) make a warm-up call and then assert the *second* call sees a non-zero cache_read_input_tokens count from the upstream's prompt-cache. VCR replay can't model cross-call provider state — both calls match the same cassette episode, so the second call returns the first call's pre-warmup response and the assertion fails: AssertionError: Expected cache_read_input_tokens > 0 on second call, but got 0. Full usage: {'input_tokens': 4986, 'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0} This started biting after the AWS SigV4 fingerprint stabilization (b637d9f): Bedrock requests now produce a stable per-access-key fingerprint instead of a per-request signature, so cassettes successfully replay where they previously always missed and re-recorded live. Opt these tests out via skip_nodeid_suffixes so they run live and match the existing pattern in tests/llm_translation/conftest.py (::test_prompt_caching). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

… to AST Address two greptile P2 review concerns on PR #27795: 1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE regardless of cassette state. A cassette that grew past the cap historically but this run only *replayed* (dirty=False) is healthy — the persister never tries to save, so the cache state is stable and the next run will replay too. Only flag OVERFLOW when dirty=True (new episodes were recorded that the persister would refuse to save). Add a regression test covering the dirty=False + large-total case. 2. _module_uses_respx did substring matching on the module source, which false-positives on comments / docstrings / string literals. A comment like # Previously tried respx.mock but switched to vcrpy would keep a file pinned on the opt-out list, defeating the dead-import pruning goal of this PR. Replace the substring scan with an ast.NodeVisitor (_RespxUsageVisitor) that only counts: - @pytest.mark.respx / @respx.mock decorators - with respx.mock(): ... (sync + async) context managers - respx.mock(...) calls outside a with/decorator - function parameters / fixture names equal to respx_mock Add tests for the comment / docstring / string-literal cases plus each real-usage pattern. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

…mary actually renders under xdist `_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate` — which runs in each xdist worker process. The controller's `pytest_terminal_summary` then reads its own empty `_session_stats` and bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections the rest of this PR adds never make it into CI logs in the dist mode CI actually uses. Ship a structured `vcr_outcome` payload via `user_properties` (which xdist round-trips) and add `aggregate_report_outcome` on the controller to fold worker outcomes into `_session_stats`. The recording process tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can tell "single-process — already counted locally" apart from "produced by a worker — needs aggregation here", and not double-count when there's no xdist. Covered by 9 new unit tests in test_vcr_classification.py including the end-to-end summary render path.

…itellm_vcr-cache-observability-and-fixes-c5bc

* fix: block NaN/Inf budget bypass and add missing non-admin guards Addresses three security issues: GHSA-wvg4-6222-3q4r: /user/update exposes max_budget, soft_budget, spend to self-editing non-admin users with no server-side guard. Non-admin callers now receive HTTP 403 if any of those fields appear in the update payload. GHSA-q775-qw9r-2r4g: _enforce_upperbound_key_params returned early (no-op) when upperbound_key_generate_params was absent from config, letting any authenticated user generate a key with unlimited max_budget. Fix adds a delegated-authority ceiling in _common_key_generation_helper: non-admins cannot grant a key more budget than their own key carries. GHSA-2rv4-xv66-fpjg: float('nan') passes every `value < 0` guard because nan < 0 is False in Python, and spend >= nan is always False, permanently disabling budget enforcement for any entity carrying a NaN max_budget. All write-time budget guards now use `not math.isfinite(v) or v < 0`. _enforce_upperbound_key_params validates finiteness unconditionally (before the early-return). All spend-enforcement comparisons in auth_checks.py are now guarded with math.isfinite(max_budget) as defense-in-depth. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: close budget ceiling bypass for callers with no max_budget (GHSA-q775) Non-admin callers whose API key has no explicit max_budget (None) could bypass the delegated-authority ceiling and create keys with arbitrary budgets. Now blocks budget assignment when caller has no budget configured. Also removes redundant inline import of LitellmUserRoles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: only apply budget ceiling to explicitly requested max_budget Capture the caller-supplied max_budget before _enforce_upperbound_key_params can fill it with a default, so auto-filled defaults don't trigger the ceiling guard for non-admin users with no budget on their own key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: capture requested max_budget before any defaults are applied Move _requested_max_budget capture before both default_key_generate_params and upperbound_key_generate_params mutations, so auto-filled values don't trigger the ceiling check for non-admin users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: allow unlimited-budget callers to delegate any budget Callers with max_budget=None (unlimited) can legitimately create budget-capped keys. Only block when caller has an explicit budget and the requested budget exceeds it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: yuneng-jiang <yuneng@berri.ai> Co-authored-by: ryan-crabbe-berri <ryan@berri.ai> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* feat(lasso): extend LassoGuardrail to support tool calling (RND-5748) * fix(lasso): PR review followups for tool-calling guardrail (RND-5748) * fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748) * fix(lasso): use model role for tool_use blocks (RND-5748) * test(lasso): add round-trip tests for message transformation (RND-5748) * fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748) * fix(lasso): inspect Responses-API input field (RND-5748) * fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748) * fix(lasso): flatten list content in tool_result.content (RND-5748) * fix(lasso): remap multimodal list content during masking (RND-5748) Bug: _map_masked_messages_back counted list-content messages in original_text_count but the remap loop only handled isinstance(str). The positional text_cursor never advanced for list messages, causing all subsequent masked texts to be written onto the wrong messages. Fix: added elif isinstance(content, list) branch that replaces the list with the masked text string and advances the cursor — mirrors the existing string-content branch. Also handles the assistant + tool_calls combo for list-content messages. Test: test_map_masked_messages_back_list_content verifies a user message with [text + image_url] followed by an assistant message gets correct masked content on both (cursor stays aligned). * refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748) The dict-vs-object access pattern (x.get('y') if isinstance(x, dict) else getattr(x, 'y', None)) was duplicated 14 times across 5 methods. _get_field(obj, field) — single-point dict/Pydantic field access. _extract_tool_call_fields(call) — returns (call_id, name, parsed_input) with JSON argument parsing, replacing ~30 duplicate lines in both async_post_call_success_hook and _expand_messages_for_classification. Also simplified _update_tool_calls_from_masked, _prepare_payload tool mapping, and _apply_masking_to_model_response call_id extraction. Net ~60 lines removed. No behavior change — all 32 tests pass. * fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748) _apply_masking_to_model_response used a bare text_cursor without verifying 1:1 correspondence between text-bearing choices and masked text entries. If Lasso returned a different number of text messages than choices with content, masked text would be applied to the wrong choice or silently skip choices. Added the same count-mismatch guard pattern already used in _map_masked_messages_back: count original text-bearing choices, compare to masked_text length, skip text remap on mismatch with a warning log. Tool_call masking via id-based lookup is unaffected. Tests: - test_apply_masking_to_model_response_multiple_choices: verifies correct per-choice masked text with 2 choices - test_apply_masking_to_model_response_count_mismatch: verifies content is left unchanged when counts disagree * fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748) * tool-call args: when function.arguments is malformed JSON or parses to a non-object, preserve the raw string as {"arguments": <raw>} so Lasso still inspects it instead of receiving input=None. Covers both pre-call and post-call extraction (shared helper). Also resolves the CodeQL empty-except warning since the except body now assigns parsed=None. * Responses-API input: when a request carries both "messages" and "input", inspect both. Previously a benign messages array let the guardrail skip data["input"] entirely. The masking write-back is split via a count boundary so masked messages flow back to data["messages"] and masked input flows back to data["input"] without cross-contamination. Tests: malformed/non-object args round-trip, dual-field classification, dual-field masking write-back split. * chore(lasso): black formatting + comment on expand skip branch (RND-5748) * black: wrap two long expressions in lasso.py and reformat dict literals in test_lasso.py to satisfy CI lint. * add a short comment in _expand_messages_for_classification explaining why empty string and None content are intentionally skipped (None is the OpenAI shape for a pure tool-call turn). * fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748) * Narrow `response.get("messages")` into a local before slicing so mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable. * Rename the two write-side `func` bindings in `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so mypy doesn't unify the dict and Any|None branches. * Rename the inner loop variable in `_apply_masking_to_model_response` from `msg` to `masked_msg` to avoid clashing with the `msg = choice.message` rebinding below. No behavior change; resolves the 7 mypy errors from the CI lint job.

- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead - Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered - Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active - Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields - Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk - Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement - Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support - Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

) The mutation-test workflow timed out at the 350-minute job cap when running whole-folder mutation against litellm/proxy/management_endpoints/ (~30 files, ~1.5 MB of source). Every mutant was running the full test suite, and mutants were generated for lines no test covers — which would survive regardless, just wasting compute. mutmut 3.x's mutate_only_covered_lines setting runs the suite once up front to compute coverage, then skips mutating uncovered lines. This cuts the mutant count dramatically and is the right semantic for the score (no test → no kill possible → uncountable). Per-mutant test filtering by function name is already automatic in mutmut 3.x; no external coverage step is needed.

…der body (#27913) * fix(rate-limit): stop v3 limiter from leaking internal stash to provider body PR #27001 (atomic TPM rate limit) introduced a reservation flow that writes four LiteLLM-internal keys onto the request data dict: _litellm_rate_limit_descriptors _litellm_tpm_reserved_tokens _litellm_tpm_reserved_model _litellm_tpm_reserved_scopes _litellm_tpm_reservation_released These keys are forwarded as request body params to the upstream provider, which rejects them as unknown fields: OpenAI -> 400 'Unknown parameter: _litellm_rate_limit_descriptors' (mapped by litellm to RateLimitError / 429, hiding the bug behind a misleading 'throttling_error' code) Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are not permitted' Net effect: every chat completion against any real provider fails the moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check itself still runs (raises 429 on over-limit), but the success path poisons the upstream body. Reproduced on litellm_internal_staging HEAD (410ce76) against gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request fails with the provider's unknown-field error. Fix: the stash is metadata only. - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS registry so we have a single source of truth for stash keys. - New helper _stash_value_in_metadata_channels writes to data['metadata'] / data['litellm_metadata'] without touching the top level. - _stash_reservation_in_data and the descriptor stash now route through that helper. _mark_reservation_released stops writing top-level. - _lookup_stashed_value also checks kwargs['metadata'] / kwargs['litellm_metadata'] (raw request_data shape) in addition to kwargs['litellm_params']['metadata'] (completion kwargs shape). - async_post_call_failure_hook now reads descriptors via the unified metadata lookup instead of request_data.get(top-level). - Defense in depth: async_pre_call_hook strips any stash key that somehow surfaced at the top level (stale cache, future refactor, test fixture) before returning. Tests: - New regression test asserts no _litellm_* stash key is present at the top level of data after async_pre_call_hook, and that the metadata channel still carries the reservation + descriptors so success / failure reconciliation works. - Existing test_tpm_concurrent.py tests that asserted top-level presence are updated to read from data['metadata'] — the location is an implementation detail; the spec is that post-call callbacks can resolve the stash. Verified end-to-end against OpenAI gpt-4o-mini and Anthropic claude-haiku-4-5 via /v1/chat/completions on a low-rpm key: - With limits not exceeded: HTTP 200, valid completion response, no leaked fields in body. - With RPM exceeded: HTTP 429 from v3 enforcement ('Rate limit exceeded ... Limit type: requests'). - With TPM exceeded: HTTP 429 from v3 enforcement ('Rate limit exceeded ... Limit type: tokens'). Full v3 hook test suite passes (171 tests). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments Address greptile P2: test fixture now uses the imported constant. Drop comments that re-explain what well-named identifiers already convey. * fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at the start of async_pre_call_hook. Without this, an authenticated caller can inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in body metadata, trigger a proxy-side rejection, and cause async_post_call_failure_hook to refund TPM counters against attacker-named scopes (e.g. another tenant's api_key). --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: allow for allowlisted redirect URIs * github comment addressing * Update litellm/proxy/_experimental/mcp_server/oauth_utils.py Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com> * harden oauth wildcard further * test: cover wildcard entry with dot-leading suffix rejection --------- Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

…de Desktop / Cowork citations) (#27886) * feat(custom_logger): add async_post_agentic_loop_response_hook Lets a CustomLogger shape the response returned by the agentic-loop follow-up call without bypassing the loop's safety / observability machinery (depth tracking, fingerprinting, etc.). Default returns the response unchanged. Used by websearch_interception to inject Anthropic-native web_search_tool_result blocks when the originating client requested a native web_search_* tool. * feat(llm_http_handler): call post-agentic-loop hook on the originating callback In _execute_anthropic_agentic_plan, after anthropic_messages.acreate returns, call the originating callback's async_post_agentic_loop_response_hook so it can mutate the final response (e.g. inject native tool_result blocks). Pass the callback through from _call_agentic_completion_hooks. Exceptions in the post-hook are caught and logged so a buggy callback can't kill the request. * feat(websearch_interception): add is_anthropic_native_web_search_tool Identifies tools the Anthropic-native clients (Claude Desktop, the Anthropic SDK, the Anthropic Console) use to request native search: type starts with "web_search_" (e.g. web_search_20250305). Rejects the LiteLLM standard tool, the OpenAI-function variant, the bare "WebSearch" legacy name, and the bare "web_search" Claude Code shape. This lets us decide per-request whether the client expects web_search_tool_result content blocks in the response, without renaming any existing constants or touching native-provider skip logic. * feat(websearch_interception): add build_web_search_tool_result_block Produces the Anthropic-native web_search_tool_result content block from a structured SearchResponse. Anthropic-native clients use this block to populate citations / source links — the existing text-blob flatten path only feeds readable evidence to the model and discards the structure, so this builder gives us the missing piece. Shape matches https://docs.anthropic.com/en/api/web-search-tool — web_search_result items carry url, title, page_age, encrypted_content (empty string when the search provider doesn't supply one). * feat(websearch_interception): emit native web_search_tool_result blocks When the originating client request carried a native Anthropic web_search_* tool, the final response now also carries web_search_tool_result content blocks alongside the model's text answer — so Claude Desktop / Anthropic SDK clients can populate the citations panel and replay conversation history with structured search evidence. Wiring: - Pre-request hooks (both deployment + Anthropic path) set a flag on kwargs when they see a native web_search_* tool, so the signal survives the conversion-to-litellm_web_search step regardless of which hook fires first. - _execute_search now returns (text, SearchResponse) so the structured results aren't lost when the text is flattened for the follow-up model call. - _build_anthropic_request_patch returns the parallel list of SearchResponse objects. - async_build_agentic_loop_plan pre-builds the web_search_tool_result blocks (one per tool_use_id) and stashes them on plan.metadata when the flag is set. - async_post_agentic_loop_response_hook reads the metadata and prepends the blocks to response.content. - _execute_agentic_loop mirrors the injection for the legacy path so both paths behave identically. Clients that send the LiteLLM standard tool keep the existing text-only behavior — no regression. * test(websearch_interception): cover native web_search_tool_result emission 18 tests across: - detector branches (native vs litellm-standard, OpenAI-function shape, Claude Desktop builtin WebSearch, bare web_search, missing type) - block-builder shape (results, none, empty) - pre-request hook flag-setting (native sets, standard does not) - async_build_agentic_loop_plan attaches blocks to plan.metadata when the flag is present, leaves metadata untouched when absent - post-hook injection into dict and object responses - legacy _execute_agentic_loop mirrors the injection so both paths return the same shape * test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return * test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return * feat(websearch_interception): emit native blocks from try_short_circuit_search The agentic-loop post-hook only fires when the model returns a tool_use block. Cowork / Claude Desktop on Bedrock actually make TWO requests per user turn: the main /v1/messages with their builtin tool, and a separate standalone /v1/messages whose only tool is web_search_20250305. That second request hits try_short_circuit_search — no agentic loop, no post-hook — and was returning text-only, leaving the citations panel empty. When the short-circuit input carries a native web_search_* tool, build a synthetic server_tool_use + web_search_tool_result pair (using the structured SearchResponse already returned by _execute_search) so the client gets the native shape it expects. The legacy text block is preserved so non-native short-circuit callers (Claude Code, github_copilot, etc.) see the same payload as before. Failure path still emits the native block pair (with empty results) plus the text-error block, so the client gets a well-formed response rather than a malformed half-shape. * test(websearch_native_blocks): cover short-circuit native-block emission Three new cases on top of the existing 18: - native web_search_20250305 short-circuit → [server_tool_use, web_search_tool_result, text], ids paired, urls/titles carried. - litellm_web_search short-circuit → text-only (no regression). - native short-circuit on search failure → still emits the native block pair (empty results) plus the text-error block, so the client never sees a malformed half-shape. * test(websearch_short_circuit): index assertions by block type, not by position Native short-circuit responses now have [server_tool_use, web_search_tool_result, text] when the input carries web_search_20250305 — find the text block by type rather than relying on content[0]. * fix(websearch_interception): gate legacy WebSearch name on schema absence Clients like Cowork / Claude Desktop ship a client-side tool named "WebSearch" with a full input_schema — they handle it themselves and expect to make a separate native web_search_20250305 sub-request for the actual search. Today is_web_search_tool matches the bare name regardless of other fields, which hijacks the client's tool server-side. The agentic loop fires on the main request, the model never gets to emit the client-side tool_use, and the separate native sub-request (where citation data flows) is never made. Net: citations panel empty. Real Anthropic client tools always carry input_schema (the API rejects them otherwise), so a bare {name: "WebSearch"} with no schema is the only thing that could be a legacy interception marker. Gate the match on schema absence: legacy callers (if any) keep working, real client-side WebSearch tools pass through untouched. * fix(websearch_interception): drop "WebSearch" from response-detection lists Post-conversion the model always sees ``litellm_web_search``, so the "WebSearch" entry in the response-side tool_use detection lists was dead at best. If a model ever did return ``tool_use(name="WebSearch")`` it would now (incorrectly) hijack the client's own ``WebSearch`` tool again — same Cowork problem we just fixed on the input side. Drop it. * test(websearch_native_blocks): cover the WebSearch legacy-name schema gate Three new cases: - {name: "WebSearch"} (bare interception marker) → still matched - {name: "WebSearch", input_schema: {...}} (Cowork client tool) → passes through untouched - {name: "WebSearch", description: "..."} (no schema) → still matched on the assumption it's a legacy marker rather than a malformed real client tool. --------- Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>

pytest-cov runs with --cov=litellm, which makes coverage.xml store paths relative to the package root (e.g. `proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when the basename is unique in the repo. Files like proxy_server.py, router.py, utils.py, main.py, and constants.py — which have duplicates under enterprise/ or other subpackages — get silently dropped during ingest. The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded path so they resolve unambiguously. Confirmed against multiple recent coverage.xml artifacts that no uploader currently emits paths already prefixed with `litellm/`, so the rule is safe to apply universally. This restores Codecov visibility for the highest-fix-rate hotspots: proxy_server.py, router.py, proxy/utils.py, litellm_logging.py, constants.py, key_management_endpoints.py, utils.py, main.py, user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py.

Audit of .github/workflows/ via gh run history shows the following have either never run or have been dormant for 10+ weeks. CI coverage that still matters is preserved on CircleCI (e.g. llm_translation_testing). Removed workflows: - test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled); CCI local_testing_part1/2 covers the same tests - llm-translation-testing.yml — last run 2025-07-10; replaced by CCI llm_translation_testing job (run_llm_translation_tests.py kept for the make test-llm-translation target) - run_observatory_tests.yml — last run 2026-03-03 (cancelled) - scan_duplicate_issues.yml — last run 2026-03-02 (failure) - publish_to_pypi.yml — never run - read_pyproject_version.yml — fires on every push to main but its echoed version output is not consumed by any downstream step Removed orphan files (no callers in workflows, CCI, or Makefile): - .github/workflows/README.md — documented only publish_to_pypi.yml - .github/workflows/update_release.py + results_stats.csv - .github/actions/helm-oci-chart-releaser/

This reverts commit e25a988. The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's auto-resolution, not before. Files with unique basenames (which were auto-resolving correctly to `litellm/<path>`) got an extra `litellm/` prepended, producing `litellm/litellm/<path>` storage. Files with ambiguous basenames (the actual target of the fix) continued to be dropped because the auto-resolution still failed for them. Net result on the verification run: 1375 files now stored under unresolvable `litellm/litellm/...` paths, and the 11 originally-missing hotspots are still missing. Reverting before piling on further changes.

…y-and-fixes-c5bc test(vcr): classify cache verdicts, surface cost leaks, and fix the two biggest leakers

…act vi.mock Per-file `vi.mock("@tremor/react", ...)` factories fully replace the setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip overrides are lost in any file that re-mocks `@tremor/react`. Without them, the real Tremor `<Button>` leaks through and its internal `useTooltip(300)` schedules a native 300ms `setTimeout` on pointer events. When the test environment is torn down before the timer fires, the trailing `setState` calls `getCurrentEventPriority`, which reads `window.event` against a destroyed jsdom -> "window is not defined" flake observed on CI. Patches the 7 leaky test files to re-supply `Button` (bare `<button>`) and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops a dead `afterEach` workaround in `user_edit_view.test.tsx` (the fake-timer dance it ran could not drain a real timer scheduled before the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`.

…decov pytest-cov treats --cov=<module-name> as a Python package and emits XML paths relative to the package root, stripping the litellm/ prefix (`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`). Codecov's auto-prefix heuristic then drops every file whose basename is ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/), `router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py` (2). The 11 highest-fix-rate hotspots have never appeared in Codecov. Switching to --cov=./litellm treats the argument as a path, which makes coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`). Each path is unambiguous, so Codecov resolves all files correctly. Verified locally: rerunning a single proxy_unit_tests test with --cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`, `filename="litellm/router.py"`, and `filename="litellm/types/router.py"` as distinct entries — exactly the disambiguation Codecov needs. Touches every workflow that uploads coverage: the two reusable GHA workflows (_test-unit-base.yml, _test-unit-services-base.yml), test-mcp.yml, and all 14 invocations in .circleci/config.yml.

…itellm_/funny-williams-dab711

chore(ci): remove unused GitHub Actions workflows and orphan files

…itellm_/peaceful-jang-c0e43b

Remove available_on_public_internet gating from delegate-auth-to-upstream paths so oauth2 + delegate_auth_to_upstream interactive servers behave the same when marked internal. Keeps M2M exclusion. Updates tests.

test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock

Log verbose_logger.warning when loading oauth2 interactive servers with available_on_public_internet=false and delegate_auth_to_upstream=true (config + DB). Dashboard Alert for the same combo. CLAUDE note for operators. Tests for log and M2M skip.

Mock the image fetch instead of downloading a 50MB+ image from upload.wikimedia.org. The runner was intermittently rate-limited (HTTP 429), so the code raised "Unable to fetch image ... Status code: 429" and the size-limit assertions failed even though pytest.raises(litellm.ImageFetchError) still matched. Mirror the established LargeImageClient pattern in tests/test_litellm/litellm_core_utils/test_image_handling.py: stub litellm.module_level_client with a response whose Content-Length exceeds the 50MB limit and bypass SSRF validation, so the size-limit rejection path is exercised deterministically with no external network dependency.

…itellm_/nostalgic-johnson-eeb7c3

The Content-Length header check in _process_image_response rejects the image before the body is streamed, so the mock body never needs to be materialized. Use an empty body instead of b"x" * 100MB (addresses greptile/cursor review feedback).

test(gemini): de-flake test_gemini_image_size_limit_exceeded

… openapi field Two CI failures, both pre-existing in different ways: 1. reasoning_effort_grid: all 33 bedrock_invoke_messages cells failed with AttributeError("module 'litellm' has no attribute 'messages'"). litellm exposes the async Anthropic Messages entrypoint as litellm.anthropic_messages (via "from .llms.anthropic.experimental_pass_through.messages.handler import *" in litellm/__init__.py), not litellm.messages.acreate. Swap the call. 2. tests/test_litellm/interactions/test_openapi_compliance.py::TestResponseCompliance::test_interaction_response_fields asserts the live Google spec contains "steps". Google's spec has churned through "outputs" -> "steps" -> neither, and presently carries neither. The test broke on main as soon as upstream dropped "steps"; pulling the key off the assert list realigns the test with the live schema. Re-add the per-turn output field once upstream stabilizes on a name. The openapi-compliance fix doesn't belong to this PR conceptually but is included here per request to unblock CI before the morning.

… not class The anthropic_messages route wraps client-side BadRequestError as AnthropicError (a BaseLLMException subclass) with status_code=400, so "except BadRequestError" missed those cells and they fell through to the generic Exception arm, returning 500 instead of the expected 400. Replace the isinstance-on-BadRequestError check with a tiny classifier that prefers BadRequestError membership, then falls back to the exception's status_code attribute (set by every BaseLLMException subclass), then 500. Apply to both _call_chat and _call_messages for consistency. Fixes the 13 CircleCI llm_translation_testing failures on bedrock_invoke_messages cells where the effort was disabled / invalid / empty / xhigh-on-unsupported / max-on-unsupported.

Four pre-existing flakes on main that gate this branch's workflow even though they're unrelated to the reasoning_effort_grid suite: 1. tests/local_testing/test_completion.py::test_completion_fireworks_ai 2. tests/local_testing/test_completion_cost.py::test_completion_cost_fireworks_ai[fireworks_ai/llama-v3p3-70b-instruct] 3. tests/llm_translation/test_fireworks_ai_translation.py::test_document_inlining_example[False] The Fireworks-hosted `llama-v3p3-70b-instruct` deployment is currently returning 404 "Model not found, inaccessible, and/or not deployed". These tests pass when the model is deployed; the issue is upstream capacity, not our code path. Wrap the live call in a try/except that pytest.skip's on litellm.NotFoundError so a Fireworks deployment hiccup no longer fails CI for unrelated PRs. 4. tests/llm_translation/test_gemini.py::test_gemini_image_size_limit_exceeded The test fetches the 32MB "Blue Marble 2002" image from Wikimedia to exercise the 50MB image-size cap. CI runners share an IP pool with noisy traffic, so Wikimedia routinely returns HTTP 429. The size-limit check never gets a chance to fire. Catch the 429 BadRequestError and pytest.skip in that case. None of these belong on this PR conceptually, but they're included per request to unblock the workflow before morning.

…ageFetchError litellm.ImageFetchError is a subclass of BadRequestError, so when Wikimedia returns 429 the pytest.raises(ImageFetchError) block matches and swallows the exception -- the outer try/except never fires. Drop the try/except and check the captured error message for "Status code: 429" after the raises block, calling pytest.skip in that case. Same intent, right control flow.

…view Two P2 nits flagged by Greptile on PR 28036: 1. _build_completion_kwargs() defaulted vertex_project to "vertex-check-481318" when VERTEX_PROJECT was unset. That value is a specific GCP project that doesn't belong to this repo, so if the env-var skip guard were ever bypassed (misconfig, direct helper call), the test would silently issue calls to a foreign project rather than failing loudly. Drop the fallback and read os.environ["VERTEX_PROJECT"] directly, mirroring how AZURE_FOUNDRY_* are handled. 2. _build_messages_kwargs() was a one-liner that returned the result of _build_completion_kwargs() unchanged -- a dead abstraction with one caller. Inline at the _call_messages call site and delete the helper.

…s-cZRwz Resolve conflicts in the five unrelated CI-flake fixes I previously landed on this branch -- staging shipped stronger versions (mocked HTTP for the Fireworks tests, mocked image-fetch for the Gemini size-limit test, switched the openapi-compliance test to the Interaction response schema instead of dropping the assertion). Take staging's version of all five files and drop my now-unreachable 429-skip lines from the Gemini test that the auto-merge left behind.

…nd migrations (#27557) Split the monolithic LiteLLM proxy into independently scalable Kubernetes components to allow separate horizontal scaling of the LLM data plane and management API surfaces - Add DatabaseURLSettings pydantic-settings model that assembles DATABASE_URL (and optional DATABASE_URL_READ_REPLICA) from discrete DATABASE_* env vars before Prisma initializes, supporting both IAM token auth (minting short-lived RDS tokens) and password auth; replaces the CLI-only path that componentized entrypoints bypass - Add gateway component (port 4000) that trims the proxy route table to the LLM data-plane surface (chat, embeddings, completions, audio, realtime, provider passthroughs, health/metrics) via an allowlist applied inside the lifespan context so plugin-registered routes are captured - Add backend component (port 4001) that exposes the management/admin surface (keys, users, teams, orgs, spend analytics, model management, SSO, audit logs) with a complementary allowlist - Add ui component — Next.js static export served by nginx (port 3000) with RSC payload routing, asset prefix aliasing, and SPA fallback for dashboard routes - Add migrations component with dedicated Dockerfile that runs prisma migrate deploy via a Helm pre-install/pre-upgrade Job, eliminating per-pod schema contention on the Prisma advisory lock - Add Helm chart (helm/litellm) with separate Deployments, Services, HPAs, and ConfigMap for each component; shared _helpers.tpl emits DATABASE_*, IAM_TOKEN_DB_AUTH, REDIS_*, and DISABLE_SCHEMA_UPDATE env vars from chart values; ingress template routes traffic to the correct component by path prefix - Add comprehensive tests for DatabaseURLSettings covering IAM auth, password auth, read replica fallbacks, operator-pinned URL preservation, and percent-encoding; add coverage test asserting gateway + backend allowlist union equals the full proxy route set - Add pydantic-settings>=2.14.1 as a proxy extra dependency and update liccheck allowlist Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

test(ci): add reasoning_effort grid e2e regression suite

…ps (#28028) * fix(ci): flag codecov uploads and enable carryforward Coverage uploads from GHA and CircleCI were unflagged. Commits that receive the push-triggered workflows more than once (re-runs, or branches cut at the same SHA) accumulated many overlapping flagless sessions, and Codecov's per-commit merge dropped the largest, ubiquitously-imported files (router.py, proxy_server.py, main.py, utils.py, cost_calculator.py) from the report even though the uploaded XMLs contained them. - codecov.yaml: flag_management.default_rules.carryforward: true - GHA reusable bases: tag each upload with its workflow/shard name - CircleCI: tag the combined upload "circleci"; also combine the agent / google_generate_content_endpoint / litellm_utils datafiles that were produced and required but missing from the combine list * fix(ci): close coverage gaps in proxy-legacy, router-unit, auth-ui, caching-redis - test-unit-proxy-legacy: route through _test-unit-base so the full proxy_unit_tests suite (incl. comprehensive test_proxy_server*.py) is measured and uploaded with per-group flags (was plain pytest, no --cov) - _test-unit-services-base: declare the enable-redis input + the six secrets test-unit-caching-redis passes; that workflow had a workflow_call signature mismatch and startup_failed on every push (never ran). Changes are additive/optional - proxy-db and security callers unchanged - circleci: add --cov + persist + combine + upload-coverage requires for litellm_router_unit_testing (tests/router_unit_tests) and auth_ui_unit_tests (tests/proxy_admin_ui_tests); neither was covered anywhere. Redundant -k subset jobs left as-is (local_testing covers them) * fix(ci): remove dead GHA Redis workflow; keep Redis on CircleCI only CircleCI redis_caching_unit_tests already runs the exact same files (tests/local_testing/test_dual_cache.py, test_redis_batch_optimizations.py, test_router_utils.py) with --cov, and that datafile is already combined and uploaded. The GHA test-unit-caching-redis workflow was redundant and had never run (workflow_call signature mismatch -> startup_failure on every push). - Delete .github/workflows/test-unit-caching-redis.yml - Revert _test-unit-services-base.yml to the flag-fix state (drop the enable-redis input / secrets / env wiring added only to prop up the GHA Redis workflow); the verified per-upload flags line is kept - The only single-star "litellm_*" branch glob lived in the deleted file; no other single-star globs exist, so none remain to widen * fix(ci): keep proxy-legacy as a standalone job to preserve required check names Routing proxy-legacy through the reusable workflow renamed each check from the bare matrix name (e.g. "proxy-response-and-misc") to "proxy-response-and-misc / Run tests". Those bare names are required status checks in branch protection, so the old contexts never reported and PRs sat "Expected — Waiting for status to be reported" indefinitely. Restore the original standalone matrix job (job name == matrix name, so the required contexts report again) and add coverage in place: --cov on pytest plus an OIDC Codecov upload flagged proxy-legacy-<group>. Net effect of the gap-#2 fix is preserved (flagged coverage for tests/proxy_unit_tests/**) without changing any check name. * revert(ci): drop all proxy-legacy changes from this PR tests/proxy_unit_tests/** is already fully covered by test-unit-proxy-db (its shard-coverage guard fails CI if any file in that dir is unassigned), which this PR already flags + carryforwards. Adding --cov and id-token:write to the legacy pull_request job was redundant and put OIDC on a job that runs untrusted PR code. Restore the file to the base version verbatim so this PR no longer touches proxy-legacy at all (also restores its original required check names). Retiring proxy-legacy in favor of proxy-db on pull_request is a separate effort that needs a branch-protection change.

… code, route/path, preprocessing latency) (#28040) * feat(otel): expose http.response.status_code on failure spans Set the OTel-standard http.response.status_code (integer) on failure spans alongside the existing OpenInference error.code (kept for back-compat). error.type is already emitted via ERROR_TYPE. Crucially, also record structured error attributes on the proxy SERVER span ('Received Proxy Server Request') from async_post_call_failure_hook - the only place the SERVER span is in hand. _handle_failure records on the litellm_request child span (the parent span is not propagated into its kwargs), so prior to this change the SERVER span that dashboards query carried only span status, never error.code/error.type. Reuses _record_exception_on_span + StandardLoggingPayloadSetup.get_error_information so values match the child span. Tests: recorder unit coverage + a hook-driven test asserting the SERVER span is stamped (the gap recorder-only tests missed). Full test_opentelemetry.py suite: 197 passed. * feat(otel): set http.route + url.path on the proxy SERVER span Add the OTel-standard http.route (low-cardinality route template, e.g. /v1/threads/{thread_id}/runs) and url.path (literal path) to the SERVER span ('Received Proxy Server Request') so dashboards can group traffic by endpoint instead of seeing every path param as a unique value. Same architectural gap as the status-code commit: the success/failure logging handlers write the litellm_request CHILD span, and _handle_success explicitly refuses to copy to the SERVER span. Verified with a console-exporter run that the SERVER span was bare on success. Unlike error info, route/path are known at request time, so set them directly on the freshly-created SERVER span in user_api_key_auth (one edit point, works for success and failure, no hook-ordering risk): - http.route from the matched FastAPI route (scope['route'].path), empirically confirmed populated at auth-dependency time. - url.path from the existing literal-path variable. New get_request_route_template helper + set_proxy_request_route_attributes (no-op on None span, so the Langfuse override stays safe). Tests: route-attribute setter + route-template helper edges. Full test_opentelemetry.py and test_auth_utils.py green. * feat(otel): set litellm.preprocessing.duration_ms on the proxy SERVER span Expose the total time LiteLLM spends before the upstream provider request begins (auth + parsing + pre-call hooks) as a single number on the SERVER span ('Received Proxy Server Request'). Window: proxy-receive -> FIRST provider handoff. Retry semantics: first attempt only (pure preprocessing, excludes retry loops + backoff). api_call_start_time is overwritten on every attempt, so a set-once first_api_call_start_time pins the first handoff. Same architectural gap as the prior two commits: the success/failure logging handlers write the litellm_request CHILD span, not the SERVER span. Set it instead from the post-call hooks on user_api_key_dict.parent_otel_span. Failure-path subtlety: request_data.pop('litellm_logging_obj') runs before the failure-hook loop, so the failure hook can't read the logging object. litellm_received_at is propagated via the existing request->metadata channel, and first_api_call_start_time is mirrored onto litellm_params.metadata, so both anchors survive into request_data and the OTel helper reads them uniformly for success and failure. Edits: user_api_key_auth (stash receive instant), litellm_pre_call_utils (propagate it), litellm_logging (set-once first handoff + metadata mirror), opentelemetry (constant + set_preprocessing_duration_attribute, called from both post-call hooks). Tests: duration helper (both container shapes, missing/negative/None edges) + set-once invariant (retry doesn't overwrite, metadata mirror). test_opentelemetry.py + test_auth_utils.py + test_litellm_logging.py: 447 passed. Verified live: SERVER span carries the attribute on success and failure, coexisting with the status-code and route attributes. * fix(otel): MyPy type-narrowing for status-code + preprocessing-duration No behavior change. MyPy (CI lint) flagged: - error_information["error_code"] is str|None: narrow via a None-checked local before int(). - _to_timestamp returns Optional[float]: resolve both anchors and return early if either is None instead of subtracting possibly-None floats. * fix(otel): stop polluting user request metadata with first_api_call_start_time The PR3 set-once preprocessing anchor was mirrored into litellm_params["metadata"] from core litellm_logging.py. That dict is the caller's request metadata, mutated in place and shared across every call path including pure SDK (litellm.acreate_batch). It got echoed into LiteLLMBatch(metadata=...), which the OpenAI batch schema types as Dict[str, str] -> pydantic ValidationError on a datetime value. - litellm_logging.py: set first_api_call_start_time only on model_call_details (success path reads it there directly). - proxy/utils.py: post_call_failure_hook lifts it off the logging object into request_data (internal top-level key, same convention as the other proxy-internal request_data keys) right before the existing litellm_logging_obj pop. Never touches user metadata. - opentelemetry.py: read the anchor from the container top level (model_call_details on success, request_data on failure). - Tests updated; add TestPostCallFailureHookLiftsFirstApiCallStartTime. Fixes the batches_testing regression introduced on this branch. * chore(otel): trim verbose comments to concise rationale Collapse multi-line why-blocks to one or two lines and drop process/plan references (PR-numbering, "the plan") from test comments. No behavior change.

openai 2.34.0 began rejecting an explicitly-passed empty-string api_key at client construction (raises OpenAIError before any request), which broke tests/local_testing/test_exceptions.py::test_exception_with_headers and related cases after uv.lock floated openai 2.33.0 -> 2.36.0. Pin back to 2.33.0 (within the existing pyproject >=2.20.0,<3.0.0 range) as a temporary stopgap; longer-term fix to follow.

) * feat(model_catalog): add Azure AI Foundry GPT-5.4 model metadata Register azure_ai GPT-5.4 variants with pricing, context limits from Foundry catalog, and capability flags for cost routing and tooling. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(model_catalog): tighten Azure AI GPT-5.4 cost and capability metadata Add supports_web_search for base GPT-5.4 aliases, priority-tier Pro rates, and mini/nano above-272k plus priority pricing for correct spend math. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(model_catalog): sync web_search flag on Azure AI GPT-5.4 dated backup row Mirror supports_web_search for azure_ai/gpt-5.4-2026-03-05 in the backup catalog so it matches model_prices_and_context_window.json. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

…28090) The proxy SERVER span ("Received Proxy Server Request") only carried http.response.status_code on failures (set in _record_exception_on_span), so success traces had no 2xx bucket — error-ratio and status-breakdown dashboards were missing their denominator and the span violated the HTTP semconv (the attribute is required whenever a response is sent). Add a set_response_status_code_attribute helper and call it from async_post_call_success_hook with 200, symmetric with the failure path and the existing route/preprocessing-duration SERVER-span attributes.

….20.2) (#28095)

…fo (#28079) * fix(proxy): sort BYOK models by team_public_model_name in /v2/model/info Team BYOK rows persist an internal `model_name` like `model_name_{team_id}_{uuid}` and expose the user-facing name via `model_info.team_public_model_name`. The UI's `getDisplayModelName` and the search filter already fall back to that field, but `_sort_models` was keying off the raw `model_name` — so BYOK rows ranked by their opaque IDs and clumped at the end of the alphabetized list instead of interleaving with non-BYOK rows. Match the UI/search behavior: prefer `team_public_model_name` when present, fall back to `model_name` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): case-insensitive DB-side search for BYOK models `_apply_search_filter_to_models` used Prisma's JSON path `string_contains` to match the BYOK `team_public_model_name` field, but that operator is case-sensitive in Postgres (no `mode: insensitive` flag like column-level string filters have). So a search for "claude" missed a stored "Claude Sonnet" via the DB branch even though the router-side path matched it case-insensitively. Widen the JSON branch to "row has a team_public_model_name set" and filter case-insensitively in Python so DB-only BYOK rows match the same terms users see in the UI. This also drops the now-unused DB-level page-size optimization and `sort_by` knob — the in-Python filter is the source of truth for `db_models_total_count` now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): scope BYOK search results to caller's accessible teams `_apply_search_filter_to_models` was widened to fetch every row with a `team_public_model_name` set so case-insensitive search could match mixed-case stored names. `/v2/model/info` is reachable by non-admin keys though, and the helper ran before `include_team_models` / `teamId` filtering — so a non-admin caller could search a common substring like "claude" and see BYOK rows belonging to teams they're not a member of. Resolve the caller's team membership once (admin → no scoping, else their `user_row.teams`) and drop BYOK rows (those with `model_info.team_id` set) outside that scope on both the router-side matches and the over-broad DB query, before display-name matching. Non-team rows are unaffected and remain gated by the existing `include_team_models` / `direct_access` paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): search by team_public_model_name and scope teamId queries - /v2/model/info search now matches both `model_name` and `model_info.team_public_model_name`, so team BYOK rows (which persist an internal `model_name_{team_id}_{uuid}`) are findable by the public name shown in the UI. DB query OR-includes a JSON-path match on `team_public_model_name` for rows that exist only in the DB. - `_filter_models_by_team_id` no longer short-circuits on the viewer's `direct_access` flag — that describes the admin viewer's own permissions and would leak every public model into a team-scoped view. Models are kept only when they belong to the team (own BYOK, in access_via_team_ids, or reachable via team.models / access groups). - Added `_authorize_team_id_query`: the untrusted `teamId` query parameter now requires the caller to be a proxy admin or a member of the requested team, otherwise returns 403. Without this, any authenticated user could enumerate another team's BYOK metadata by guessing the team id. - `_get_caller_byok_team_scope` now treats `PROXY_ADMIN_VIEW_ONLY` the same as `PROXY_ADMIN` (both are admin roles); previously VIEW_ONLY admins fell through to a user-id team lookup and saw only their own teams' BYOK rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): bound BYOK search DB fetch in /v2/model/info Previously the DB-side search OR'd a JSON-path predicate `{model_info: {path: [team_public_model_name], string_contains: ""}}` to compensate for Prisma's case-sensitive JSON `string_contains` on Postgres. That predicate matches every row that has any `team_public_model_name` set, so any authenticated caller could force a full BYOK-table read with `/v2/model/info?search=x` regardless of page size. Drop the JSON-path branch. The DB query now does a bounded `model_name contains <search>` lookup. BYOK rows that are loaded into the router are still searchable by their `team_public_model_name` via the router-side filter; only the rare edge case of a BYOK row that exists only in the DB (router sync failed) loses display-name search, which is an acceptable trade-off given the DoS surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(proxy): bound DB find_many in /v2/model/info search The previous bounding patch dropped the page-aware `take=N` on `find_many`, so a broad `?search=model` would load and decrypt every matching DB row on each request even though the response only returns one page. Restore bounded fetches in `_apply_search_filter_to_models`: * Unsorted searches use `take = max(0, page * size - router_count)`, i.e. exactly one page worth of remaining DB rows. * Sorted searches need ordering across the full match set, so they cap at `_SORTED_SEARCH_DB_FETCH_CAP = 500` instead of fetching everything. * Total count comes from a cheap `count(...)` query so pagination stays accurate without materializing every row. Wired `page`, `size`, and `sortBy` through from the endpoint and added a regression test covering both `take` values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(proxy): extract DB-fetch helper to satisfy PLR0915 _apply_search_filter_to_models tripped Ruff's "too many statements" (51 > 50) after the bounded-fetch fix. Move the DB-side block into `_fetch_db_models_for_search`, which keeps the same behavior: * Bounded `take` via page math (unsorted) or `_SORTED_SEARCH_DB_FETCH_CAP` (sorted) * Cheap `count(...)` for accurate pagination totals * Caller-team scope applied to fetched rows before decrypt Pure refactor; no behavior change. All 8 BYOK/team tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: apply black formatting to _fetch_db_models_for_search CI's "Check Black formatting" step flagged one line in the helper added in d55eecf. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add AWS ECS Fargate stack with Aurora Postgres (IAM auth), ElastiCache Redis, S3, ALB with path-based routing to gateway/backend/ui components, Application Auto Scaling, and automated DB bootstrap + prisma migration via local-exec provisioners - Add GCP Cloud Run stack with Cloud SQL Postgres (password auth), Memorystore Redis, GCS, external HTTPS load balancer with serverless NEGs and URL map routing, and automated prisma migration via Cloud Run Job - Both stacks support typed proxy_config input mirroring the helm chart's gateway.config.proxy_config, per-component extra env vars, and Secret Manager references for provider API keys - Gateway/backend services depend on terraform_data.migration so they never start before the schema is in place, eliminating crash-loop windows on first apply - AWS stack uses IAM database authentication with a one-shot Fargate bootstrap task that creates and grants the rds_iam role to the application user; GCP stack uses password auth assembled at container startup to avoid Cloud SQL Auth Proxy sidecar complexity - Add .gitignore rules for Terraform state files, plan files, tfvars inputs, provider binaries, and crash logs while explicitly keeping .terraform.lock.hcl for provider version pinning - Include terraform.tfvars.example files, provider lock files, and comprehensive README documentation covering architecture, TLS setup, image pull strategies, and quick-start instructions for both stacks Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

…{"detail":"invalid_request"} (#28086) * fix(mcp-oauth): add PROXY_BASE_URL escape hatch + diagnostic logging for invalid_request Customers hitting "{"detail":"invalid_request"}" on the MCP /authorize endpoint had no way to recover when their ingress mangles X-Forwarded-* headers (the same-origin check in validate_trusted_redirect_uri compares the browser-supplied redirect_uri against get_request_base_url, which is reconstructed from those headers). Two contained changes: 1. get_request_base_url now honours PROXY_BASE_URL as the canonical public origin when set, bypassing the X-Forwarded-* trust gate entirely. Operators who know their public URL can set it once instead of debugging ingress header rewrites. 2. The rejection path in validate_trusted_redirect_uri emits a WARN log carrying the redirect_uri, computed proxy base, and the X-Forwarded-* / Host headers seen. A bare 400 was undiagnosable; this turns it into a one-line root-cause. * test(mcp-oauth): capture warnings from correct logger ("LiteLLM") Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(mcp-oauth): reject malformed PROXY_BASE_URL with one-shot diagnostic A scheme-less PROXY_BASE_URL (e.g. "litellm.example.com" instead of "https://litellm.example.com") would sail through urlparse with empty scheme + netloc, silently breaking every same-origin compare in validate_trusted_redirect_uri and leaving the operator staring at the same opaque 400 the env var was meant to fix. Validate it once at read time: only honour values that parse as http(s) URLs with a non-empty netloc; otherwise log a one-shot WARN naming the bad value and fall through to the request-derived origin so the proxy still serves traffic. * fix(mcp/oauth): normalize PROXY_BASE_URL to strip query/fragment Match the X-Forwarded-* path's normalization so a configured PROXY_BASE_URL containing a query string or fragment does not break downstream f-string concatenation like f"{base_url}/callback". Co-authored-by: Yassin Kortam <yassin@berri.ai> * refactor(mcp-oauth): drop non-essential comments from PROXY_BASE_URL changes Strip narrative comments and verbose docstrings added in this PR; the code is intuitive enough on its own and the log messages already carry their own diagnostic context. Pre-existing comments are left untouched. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Yassin Kortam <yassin@berri.ai>

* bump: version 0.1.40 → 0.1.41 * bump: version 1.85.0 → 1.86.0 * add uv lock

greptile-apps · 2026-05-17T01:32:21Z

Too many files changed for review. (625 files found, 100 file limit)

CLAassistant · 2026-05-17T01:32:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
9 out of 15 committers have signed the CLA.

✅ vladpolevoi
✅ ryan-crabbe-berri
✅ dennishenry
✅ mateo-berri
✅ yuneng-berri
✅ shivamrawat1
✅ Michael-RZ-Berri
✅ Sameerlite
✅ kenany
❌ claude
❌ krrish-berri-2
❌ yassin-berriai
❌ ishaan-berri
❌ shin-berri
❌ cursoragent
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

mateo-berri

LGTM

codspeed-hq · 2026-05-17T01:33:51Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing litellm_internal_staging (cf9b5e4) with main (e58a561)}

codecov · 2026-05-17T01:35:02Z

Codecov Report

❌ Patch coverage is 90.38462% with 55 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/proxy/_experimental/mcp_server/server.py	75.47%	13 Missing ⚠️
litellm/integrations/opentelemetry.py	90.21%	9 Missing ⚠️
...llm/integrations/websearch_interception/handler.py	88.57%	8 Missing ⚠️
litellm/llms/custom_httpx/llm_http_handler.py	0.00%	8 Missing ⚠️
litellm/cost_calculator.py	14.28%	6 Missing ⚠️
litellm/integrations/prometheus.py	88.23%	4 Missing ⚠️
...erimental/mcp_server/auth/user_api_key_auth_mcp.py	93.02%	3 Missing ⚠️
litellm/integrations/custom_logger.py	50.00%	1 Missing ⚠️
...integrations/opentelemetry_utils/gen_ai_semconv.py	98.50%	1 Missing ⚠️
...tellm/integrations/websearch_interception/tools.py	83.33%	1 Missing ⚠️
... and 1 more

📢 Thoughts on this report? Let us know!

cursoragent and others added 30 commits May 13, 2026 00:31

fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes

ba0f4c1

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

a8f1940

…itellm_vcr-cache-observability-and-fixes-c5bc

fix(guardrails): improve CrowdStrike AIDR input handling (#26658)

649eb2d

Merge pull request #27795 from BerriAI/litellm_vcr-cache-observabilit…

16bd819

…y-and-fixes-c5bc test(vcr): classify cache verdicts, surface cost leaks, and fix the two biggest leakers

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

d4be90c

…itellm_/funny-williams-dab711

Merge pull request #27957 from BerriAI/litellm_/blissful-hawking-dfd43c

27de654

chore(ci): remove unused GitHub Actions workflows and orphan files

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

5176e22

…itellm_/peaceful-jang-c0e43b

fix(mcp): allow delegate PKCE bypass for internal MCP servers

5aabfcc

Remove available_on_public_internet gating from delegate-auth-to-upstream paths so oauth2 + delegate_auth_to_upstream interactive servers behave the same when marked internal. Keeps M2M exclusion. Updates tests.

Merge pull request #27958 from BerriAI/litellm_/funny-williams-dab711

bcbae93

test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock

yuneng-berri and others added 23 commits May 15, 2026 22:34

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

2130bdc

…itellm_/nostalgic-johnson-eeb7c3

Merge pull request #28039 from BerriAI/litellm_/nostalgic-johnson-eeb7c3

ec2f3aa

test(gemini): de-flake test_gemini_image_size_limit_exceeded

refactor: strip PR-introduced docstrings and explanatory comments

f9485f1

Merge pull request #28036 from BerriAI/litellm_grid-v4-e2e-tests-cZRwz

57e5e4a

test(ci): add reasoning_effort grid e2e regression suite

chore: update Next.js build artifacts (2026-05-16 22:22 UTC, node v20…

cd551f6

….20.2) (#28095)

[Infra] Bump versions (#28094)

cf9b5e4

* bump: version 0.1.40 → 0.1.41 * bump: version 1.85.0 → 1.86.0 * add uv lock

mateo-berri approved these changes May 17, 2026

View reviewed changes

shin-berri approved these changes May 17, 2026

View reviewed changes

yuneng-berri merged commit a72414a into main May 17, 2026
110 of 131 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Infra] Promote internal staging to main#28100

[Infra] Promote internal staging to main#28100
yuneng-berri merged 82 commits into
mainfrom
litellm_internal_staging

yuneng-berri commented May 17, 2026

Uh oh!

greptile-apps Bot commented May 17, 2026

Uh oh!

CLAassistant commented May 17, 2026 •

edited

Loading

Uh oh!

mateo-berri left a comment

Uh oh!

codspeed-hq Bot commented May 17, 2026

Uh oh!

codecov Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Uh oh!

Conversation

yuneng-berri commented May 17, 2026

Relevant issues

Linear ticket

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Uh oh!

greptile-apps Bot commented May 17, 2026

Uh oh!

CLAassistant commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mateo-berri left a comment

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented May 17, 2026

Merging this PR will not alter performance

Uh oh!

codecov Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

CLAassistant commented May 17, 2026 •

edited

Loading

codecov Bot commented May 17, 2026 •

edited

Loading