Skip to content

fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller#27984

Merged
mateo-berri merged 4 commits into
litellm_internal_stagingfrom
litellm_fix-managed-batch-raw-output-file-id
May 15, 2026
Merged

fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller#27984
mateo-berri merged 4 commits into
litellm_internal_stagingfrom
litellm_fix-managed-batch-raw-output-file-id

Conversation

@Sameerlite

@Sameerlite Sameerlite commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Problem

LiteLLM_ManagedObjectTable.file_object can be written with a raw provider output_file_id (e.g. file-b83bb643-...) instead of a LiteLLM managed base64 ID. When a client later calls GET /files/{output_file_id}/content using this raw ID, the proxy cannot recover the Azure/model routing context and falls back to default OpenAI credentials, producing:

The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

Root Cause

CheckBatchCost (the background enterprise poller) writes the completed batch's file_object directly to the DB without going through async_post_call_success_hook, which is the HTTP path responsible for converting raw provider file IDs to managed base64 IDs.

Fix

In check_batch_cost.py, after resolving the deployment and provider, retrieve the managed_files hook and convert any raw output_file_id / error_file_id on the response object to managed base64 IDs (and register the mapping in litellm_managedfiletable) before the model_dump_json() DB write.

Test

Added test_raw_output_file_id_converted_to_managed_id to the existing test_check_batch_cost.py suite. It verifies that get_unified_output_file_id and store_unified_file_id are called with the correct arguments and that response.output_file_id is updated to the managed ID before the DB write.

Screenshot 2026-05-15 at 2 37 14 PM

Note

Medium Risk
Touches batch polling and managed-file ID translation used for later /files/{id}/content access, so a mistake could break batch retrieval/file downloads. Changes are localized and covered by new unit/regression tests.

Overview
Fixes the CheckBatchCost background poller to translate raw provider output_file_id/error_file_id values into managed base64 IDs (via the managed_files hook) and store the mapping before writing the batch file_object back to LiteLLM_ManagedObjectTable.

Adds a focused unit test asserting the ID conversion + store_unified_file_id registration, and updates the managed-files access regression test setup to include team_id and the new hook calls.

Reviewed by Cursor Bugbot for commit e8f2324. Bugbot is set up for automated code reviews on this repo. Configure here.

…ckBatchCost poller

CheckBatchCost bypasses async_post_call_success_hook, causing raw provider
output_file_ids to be persisted in LiteLLM_ManagedObjectTable. This fix converts
output_file_id and error_file_id to managed base64 IDs before the DB write.

Co-authored-by: Cursor <cursoragent@cursor.com>
@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes the CheckBatchCost background poller so that raw provider output_file_id / error_file_id values are converted to managed base64 IDs (and registered in litellm_managedfiletable) before the batch file_object is written back to LiteLLM_ManagedObjectTable, preventing later GET /files/{id}/content calls from falling through to default OpenAI credentials.

  • check_batch_cost.py: After resolving deployment_info and model_name, retrieves the managed_files hook and, for each of output_file_id / error_file_id, calls get_unified_output_file_id then store_unified_file_id (in that order) before setattr-ing the managed ID onto the response; failures are caught and warned without blocking the DB write.
  • test_check_batch_cost.py: New test_raw_output_file_id_converted_to_managed_id covers both file-ID branches with a non-None error_file_id, asserts correct call counts and model_mappings values, and verifies both attributes are updated before the DB flush.
  • test_managed_files_access_check.py: Minimal stub additions (team_id, AsyncMock for store_unified_file_id, return_value for get_unified_output_file_id) prevent the existing regression test from failing on the new code path while leaving its original assertions intact.

Confidence Score: 5/5

Safe to merge — the change is localized to the batch cost poller, adds an opt-out conversion path that silently warns on failure, and does not alter existing behavior when the managed_files hook is absent.

The store_unified_file_id call correctly precedes the setattr, so a transient DB error leaves the response with its original raw file ID rather than a dangling managed ID. The idempotency guard prevents double-conversion. The new test covers both output_file_id and error_file_id branches, and the existing regression test's core assertions are unchanged.

No files require special attention.

Important Files Changed

Filename Overview
enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py Adds post-completion conversion of raw provider file IDs to managed base64 IDs in the batch cost poller; ordering is correct (store before setattr), and the guard against double-conversion is in place.
tests/proxy_unit_tests/test_check_batch_cost.py Adds a focused regression test covering both output_file_id and error_file_id conversion; side_effect ordering for _is_base64_encoded_unified_file_id is consistent with the four production call sites; existing fixture now stubs get_proxy_hook to return None so pre-existing tests are unaffected.
tests/test_litellm/enterprise/proxy/test_managed_files_access_check.py Minimal additions to existing test: team_id stub, AsyncMock for store_unified_file_id, and a return_value for get_unified_output_file_id — all required to prevent the new code path from raising when the hook is present; no assertions weakened.

Reviews (3): Last reviewed commit: "fix managed file test" | Re-trigger Greptile

Comment thread tests/proxy_unit_tests/test_check_batch_cost.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high mode and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: setattr before store creates orphaned managed file ID
    • Moved the setattr(response, _file_attr, _unified_file_id) call to after the await managed_files_hook.store_unified_file_id(...) call so the response only retains the managed ID when the DB mapping has been successfully persisted.
  • ✅ Fixed: Missing team_id breaks team-level file access control
    • Populated team_id on the minimal UserAPIKeyAuth from job.team_id so the managed file record is created with the correct team ownership for team-level access checks.
Preview (b99b935b2f)
diff --git a/enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py b/enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py
--- a/enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py
+++ b/enterprise/litellm_enterprise/proxy/common_utils/check_batch_cost.py
@@ -300,6 +300,42 @@
                     custom_llm_provider=custom_llm_provider,
                 )
 
+                # CheckBatchCost bypasses async_post_call_success_hook, so convert raw
+                # output/error file IDs to managed base64 IDs before the DB write here.
+                managed_files_hook = self.proxy_logging_obj.get_proxy_hook("managed_files")
+                if managed_files_hook is not None:
+                    from litellm.proxy._types import UserAPIKeyAuth
+                    _minimal_auth = UserAPIKeyAuth(
+                        user_id=job.created_by or "default-user-id",
+                        team_id=getattr(job, "team_id", None),
+                    )
+                    for _file_attr in ["output_file_id", "error_file_id"]:
+                        _raw_file_id = getattr(response, _file_attr, None)
+                        if _raw_file_id and not _is_base64_encoded_unified_file_id(_raw_file_id):
+                            try:
+                                _unified_file_id = managed_files_hook.get_unified_output_file_id(
+                                    output_file_id=_raw_file_id,
+                                    model_id=model_id,
+                                    model_name=str(model_name) if model_name else deployment_info.model_name or None,
+                                )
+                                await managed_files_hook.store_unified_file_id(
+                                    file_id=_unified_file_id,
+                                    file_object=None,
+                                    litellm_parent_otel_span=None,
+                                    model_mappings={model_id: _raw_file_id},
+                                    user_api_key_dict=_minimal_auth,
+                                )
+                                setattr(response, _file_attr, _unified_file_id)
+                                verbose_proxy_logger.info(
+                                    f"CheckBatchCost: converted {_file_attr} "
+                                    f"{_raw_file_id!r} -> managed ID for batch {batch_id}"
+                                )
+                            except Exception as _e:
+                                verbose_proxy_logger.warning(
+                                    f"CheckBatchCost: failed to create managed file ID for "
+                                    f"{_file_attr}={_raw_file_id!r}: {_e}"
+                                )
+
                 # Pass deployment model_info so custom batch pricing
                 # (input_cost_per_token_batches etc.) is used for cost calc
                 deployment_model_info = deployment_info.model_info.model_dump() if deployment_info.model_info else {}

diff --git a/tests/proxy_unit_tests/test_check_batch_cost.py b/tests/proxy_unit_tests/test_check_batch_cost.py
--- a/tests/proxy_unit_tests/test_check_batch_cost.py
+++ b/tests/proxy_unit_tests/test_check_batch_cost.py
@@ -22,7 +22,9 @@
 
     @pytest.fixture
     def mock_proxy_logging_obj(self):
-        return MagicMock()
+        mock = MagicMock()
+        mock.get_proxy_hook.return_value = None
+        return mock
 
     @pytest.fixture
     def mock_llm_router(self):
@@ -372,3 +374,122 @@
             update_data["batch_processed"] is True
         ), "update() must include batch_processed=True when column is present"
         assert update_data["status"] == "complete"
+
+    @pytest.mark.asyncio
+    async def test_raw_output_file_id_converted_to_managed_id(
+        self, check_batch_cost_instance, mock_prisma_client, mock_llm_router
+    ):
+        """CheckBatchCost must convert a raw provider output_file_id to a managed base64 ID.
+
+        Without this, GET /batches/{id} returns a raw file ID that cannot be routed
+        through the proxy, causing API_KEY errors when clients call GET /files/{id}/content.
+        """
+        mock_prisma_client.db.litellm_managedobjecttable.update_many = AsyncMock(
+            return_value=0
+        )
+        mock_prisma_client.db.litellm_managedobjecttable.update = AsyncMock()
+        mock_prisma_client.db.litellm_usertable.find_unique = AsyncMock(
+            return_value=None
+        )
+
+        mock_job = MagicMock()
+        mock_job.id = "job-raw-file-1"
+        mock_job.unified_object_id = "dW5pZmllZF9iYXRjaF9pZA=="
+        mock_job.created_by = "user-1"
+
+        check_batch_cost_instance._has_batch_processed_column = True
+        mock_prisma_client.db.litellm_managedobjecttable.find_many = AsyncMock(
+            return_value=[mock_job]
+        )
+
+        raw_output_file_id = "file-batch-output-abc123"
+        fake_managed_id = "bGl0ZWxsbV9wcm94eTo6bWFuYWdlZA=="
+
+        mock_response = MagicMock()
+        mock_response.status = "completed"
+        mock_response.output_file_id = raw_output_file_id
+        mock_response.error_file_id = None
+        mock_response.model_dump_json.return_value = (
+            '{"id":"batch-1","status":"completed"}'
+        )
+
+        mock_llm_router.aretrieve_batch = AsyncMock(return_value=mock_response)
+        mock_llm_router.get_deployment_credentials_with_provider = MagicMock(
+            return_value={"api_key": "sk-test"}
+        )
+
+        mock_deployment = MagicMock()
+        mock_deployment.litellm_params.custom_llm_provider = "azure"
+        mock_deployment.litellm_params.model = "azure/gpt-5-mini"
+        mock_deployment.model_name = "gpt-5-batch"
+        mock_deployment.model_info.model_dump.return_value = {}
+        mock_llm_router.get_deployment = MagicMock(return_value=mock_deployment)
+
+        mock_hook = MagicMock()
+        mock_hook.get_unified_output_file_id.return_value = fake_managed_id
+        mock_hook.store_unified_file_id = AsyncMock()
+        check_batch_cost_instance.proxy_logging_obj.get_proxy_hook.return_value = (
+            mock_hook
+        )
+
+        mock_file_content = MagicMock()
+        mock_file_content.content = b'{"id":"req-1"}'
+        decoded_id = "llm_model_id,model-123;llm_batch_id,batch-456;"
+
+        with (
+            patch(
+                "litellm.proxy.openai_files_endpoints.common_utils._is_base64_encoded_unified_file_id",
+                # call 1: job unified_object_id decode, call 2: existing raw check,
+                # call 3: fix's guard for output_file_id
+                side_effect=[decoded_id, None, None],
+            ),
+            patch(
+                "litellm.proxy.openai_files_endpoints.common_utils.get_model_id_from_unified_batch_id",
+                return_value="model-123",
+            ),
+            patch(
+                "litellm.proxy.openai_files_endpoints.common_utils.get_batch_id_from_unified_batch_id",
+                return_value="batch-456",
+            ),
+            patch(
+                "litellm.files.main.afile_content",
+                new_callable=AsyncMock,
+                return_value=mock_file_content,
+            ),
+            patch(
+                "litellm.batches.batch_utils._get_file_content_as_dictionary",
+                return_value=[{"id": "req-1"}],
+            ),
+            patch(
+                "litellm.batches.batch_utils.calculate_batch_cost_and_usage",
+                new_callable=AsyncMock,
+                return_value=(
+                    0.01,
+                    {"prompt_tokens": 10, "completion_tokens": 5},
+                    ["gpt-4"],
+                ),
+            ),
+            patch(
+                "litellm.litellm_core_utils.get_llm_provider_logic.get_llm_provider",
+                return_value=("gpt-5-mini", "azure", None, None),
+            ),
+            patch(
+                "litellm.litellm_core_utils.litellm_logging.Logging"
+            ) as mock_logging_cls,
+        ):
+            mock_logging_obj = MagicMock()
+            mock_logging_obj.async_success_handler = AsyncMock()
+            mock_logging_cls.return_value = mock_logging_obj
+
+            await check_batch_cost_instance.check_batch_cost()
+
+        mock_hook.get_unified_output_file_id.assert_called_once_with(
+            output_file_id=raw_output_file_id,
+            model_id="model-123",
+            model_name="gpt-5-mini",
+        )
+        mock_hook.store_unified_file_id.assert_awaited_once()
+        store_kwargs = mock_hook.store_unified_file_id.call_args[1]
+        assert store_kwargs["file_id"] == fake_managed_id
+        assert store_kwargs["model_mappings"] == {"model-123": raw_output_file_id}
+        assert mock_response.output_file_id == fake_managed_id

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 13d34a1. Configure here.

…and propagate team_id

- Move setattr after store_unified_file_id so the response only receives the
  managed ID once the DB record is successfully written. Avoids serializing
  an orphaned managed ID into file_object when the store call fails.
- Populate team_id on the minimal UserAPIKeyAuth from job.team_id so the
  managed file record is created with the correct team ownership, allowing
  other team members to access the batch output file via /files/{id}/content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Sameerlite
❌ cursoragent
You have signed the CLA already but the status is still pending? Let us recheck it.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Sameerlite

Copy link
Copy Markdown
Collaborator Author

@greptile re review

@Sameerlite Sameerlite requested a review from mateo-berri May 15, 2026 10:28
@mateo-berri

Copy link
Copy Markdown
Collaborator

@greptileai

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@mateo-berri mateo-berri merged commit cbdc70d into litellm_internal_staging May 15, 2026
116 checks passed
@mateo-berri mateo-berri deleted the litellm_fix-managed-batch-raw-output-file-id branch May 15, 2026 11:41
Sameerlite added a commit that referenced this pull request May 22, 2026
* test(vcr): classify cache verdicts, detect live calls, surface cost leaks

Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop dead 'from respx import MockRouter' imports

These seven test files were on _RESPX_CONFLICTING_FILES, which made the
auto-marker skip them entirely. Inspecting the source shows the only
respx artifact is a top-level 'from respx import MockRouter' that no
test ever uses - no @pytest.mark.respx, no respx_mock fixture, no
respx.mock context manager. The import is dead code left over from a
previous mocking pattern.

Now that apply_vcr_auto_marker_to_items detects respx per-item via the
marker / fixture chain (b637d9f64a), the file-level skip is no longer
needed for these files - they were the reason the OpenAI tests
(test_o3_reasoning_effort, test_streaming_response[o1/o3-mini],
TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search,
TestOpenAIO3::test_web_search, etc.) ran live every CI build despite
the cassette cache being healthy.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(image_edits): regenerate fixtures per call instead of holding open module-level file handles

Module-level

    TEST_IMAGES = [
        open(os.path.join(pwd, 'ishaan_github.png'), 'rb'),
        open(os.path.join(pwd, 'litellm_site.png'), 'rb'),
    ]
    SINGLE_TEST_IMAGE = open(...)

opens the file once at import. After the first multipart upload, the
file pointer is at EOF, so every subsequent test in the same xdist
worker sends an empty multipart body. That non-determinism (a) blows
the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so
_RedisPersister.save_cassette refuses to save it, and (b) re-bills the
live image edit endpoint on every CI run.

Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py
shows six tests parking at 51-52 cassette entries
(TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False],
TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio,
test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False],
test_multiple_image_edit_with_different_formats).

Replace the module-level file handles with _make_test_images() /
_make_single_test_image() factories that return fresh _RewindableImage
(BytesIO subclass) objects whose pointer always starts at 0. The image
bytes are read once at import into module-level constants
(_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is
unchanged.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): match real Bedrock hostnames in live-call probe

The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock
endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com'
(region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit
host check for that pattern so Bedrock live calls are visible to the
probe, and update the unit test accordingly. Also drop the unused
'_LIVE_CALL_PROBE_INSTALLED' module variable.

* fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes

* fix(image_edits): drop _RewindableImage to prevent infinite multipart upload

The _RewindableImage(BytesIO) wrapper auto-rewound on every read after
EOF, which made the OpenAI SDK's multipart upload writer read the same
bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd:

    [gw0] node down: Not properly terminated
    replacing crashed worker gw0
    ...
    worker 'gw1' crashed while running
        'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]'

The auto-rewind was added defensively for parametrized + flaky-retried
tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk
already calls get_base_image_edit_call_args() once per invocation and
that helper now constructs fresh streams via _make_test_images(), so
rewinding inside the stream is unnecessary. Replace with plain BytesIO
seeded with the cached image bytes.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible

The pass_through prompt-caching tests
(test_prompt_caching_returns_cache_read_tokens_on_second_call,
test_prompt_caching_streaming_second_call_returns_cache_read) make a
warm-up call and then assert the *second* call sees a non-zero
cache_read_input_tokens count from the upstream's prompt-cache. VCR
replay can't model cross-call provider state — both calls match the
same cassette episode, so the second call returns the first call's
pre-warmup response and the assertion fails:

    AssertionError: Expected cache_read_input_tokens > 0 on second call,
    but got 0. Full usage: {'input_tokens': 4986,
    'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0}

This started biting after the AWS SigV4 fingerprint stabilization
(b637d9f64a): Bedrock requests now produce a stable per-access-key
fingerprint instead of a per-request signature, so cassettes
successfully replay where they previously always missed and re-recorded
live. Opt these tests out via skip_nodeid_suffixes so they run live and
match the existing pattern in tests/llm_translation/conftest.py
(::test_prompt_caching).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): tighten OVERFLOW classification and switch respx detection to AST

Address two greptile P2 review concerns on PR #27795:

1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE
   regardless of cassette state. A cassette that grew past the cap
   historically but this run only *replayed* (dirty=False) is
   healthy — the persister never tries to save, so the cache state is
   stable and the next run will replay too. Only flag OVERFLOW when
   dirty=True (new episodes were recorded that the persister would
   refuse to save). Add a regression test covering the
   dirty=False + large-total case.

2. _module_uses_respx did substring matching on the module source,
   which false-positives on comments / docstrings / string literals.
   A comment like # Previously tried respx.mock but switched to
   vcrpy would keep a file pinned on the opt-out list, defeating the
   dead-import pruning goal of this PR. Replace the substring scan
   with an ast.NodeVisitor (_RespxUsageVisitor) that only
   counts:

     - @pytest.mark.respx / @respx.mock decorators
     - with respx.mock(): ... (sync + async) context managers
     - respx.mock(...) calls outside a with/decorator
     - function parameters / fixture names equal to respx_mock

   Add tests for the comment / docstring / string-literal cases plus
   each real-usage pattern.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist

`_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate`
— which runs in each xdist worker process. The controller's
`pytest_terminal_summary` then reads its own empty `_session_stats` and
bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections
the rest of this PR adds never make it into CI logs in the dist mode CI
actually uses.

Ship a structured `vcr_outcome` payload via `user_properties` (which
xdist round-trips) and add `aggregate_report_outcome` on the controller
to fold worker outcomes into `_session_stats`. The recording process
tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can
tell "single-process — already counted locally" apart from "produced by
a worker — needs aggregation here", and not double-count when there's
no xdist.

Covered by 9 new unit tests in test_vcr_classification.py including the
end-to-end summary render path.

* fix(guardrails): improve CrowdStrike AIDR input handling (#26658)

* feat(lasso): add tool-calling support to LassoGuardrail (#27648)

* feat(lasso): extend LassoGuardrail to support tool calling (RND-5748)

* fix(lasso): PR review followups for tool-calling guardrail (RND-5748)

* fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748)

* fix(lasso): use model role for tool_use blocks (RND-5748)

* test(lasso): add round-trip tests for message transformation (RND-5748)

* fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748)

* fix(lasso): inspect Responses-API input field (RND-5748)

* fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748)

* fix(lasso): flatten list content in tool_result.content (RND-5748)

* fix(lasso): remap multimodal list content during masking (RND-5748)

Bug: _map_masked_messages_back counted list-content messages in
original_text_count but the remap loop only handled isinstance(str).
The positional text_cursor never advanced for list messages, causing
all subsequent masked texts to be written onto the wrong messages.

Fix: added elif isinstance(content, list) branch that replaces the
list with the masked text string and advances the cursor — mirrors
the existing string-content branch. Also handles the assistant +
tool_calls combo for list-content messages.

Test: test_map_masked_messages_back_list_content verifies a user
message with [text + image_url] followed by an assistant message
gets correct masked content on both (cursor stays aligned).

* refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748)

The dict-vs-object access pattern (x.get('y') if isinstance(x, dict)
else getattr(x, 'y', None)) was duplicated 14 times across 5 methods.

_get_field(obj, field) — single-point dict/Pydantic field access.
_extract_tool_call_fields(call) — returns (call_id, name, parsed_input)
with JSON argument parsing, replacing ~30 duplicate lines in both
async_post_call_success_hook and _expand_messages_for_classification.

Also simplified _update_tool_calls_from_masked, _prepare_payload tool
mapping, and _apply_masking_to_model_response call_id extraction.

Net ~60 lines removed. No behavior change — all 32 tests pass.

* fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748)

_apply_masking_to_model_response used a bare text_cursor without
verifying 1:1 correspondence between text-bearing choices and masked
text entries. If Lasso returned a different number of text messages
than choices with content, masked text would be applied to the wrong
choice or silently skip choices.

Added the same count-mismatch guard pattern already used in
_map_masked_messages_back: count original text-bearing choices,
compare to masked_text length, skip text remap on mismatch with a
warning log. Tool_call masking via id-based lookup is unaffected.

Tests:
- test_apply_masking_to_model_response_multiple_choices: verifies
  correct per-choice masked text with 2 choices
- test_apply_masking_to_model_response_count_mismatch: verifies
  content is left unchanged when counts disagree

* fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748)

* tool-call args: when function.arguments is malformed JSON or parses
  to a non-object, preserve the raw string as {"arguments": <raw>} so
  Lasso still inspects it instead of receiving input=None. Covers both
  pre-call and post-call extraction (shared helper). Also resolves the
  CodeQL empty-except warning since the except body now assigns parsed=None.
* Responses-API input: when a request carries both "messages" and
  "input", inspect both. Previously a benign messages array let the
  guardrail skip data["input"] entirely. The masking write-back is
  split via a count boundary so masked messages flow back to
  data["messages"] and masked input flows back to data["input"]
  without cross-contamination.

Tests: malformed/non-object args round-trip, dual-field classification,
dual-field masking write-back split.

* chore(lasso): black formatting + comment on expand skip branch (RND-5748)

* black: wrap two long expressions in lasso.py and reformat dict
  literals in test_lasso.py to satisfy CI lint.
* add a short comment in _expand_messages_for_classification
  explaining why empty string and None content are intentionally
  skipped (None is the OpenAI shape for a pure tool-call turn).

* fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748)

* Narrow `response.get("messages")` into a local before slicing so
  mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable.
* Rename the two write-side `func` bindings in
  `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so
  mypy doesn't unify the dict and Any|None branches.
* Rename the inner loop variable in `_apply_masking_to_model_response`
  from `msg` to `masked_msg` to avoid clashing with the
  `msg = choice.message` rebinding below.

No behavior change; resolves the 7 mypy errors from the CI lint job.

* perf: eliminate per-request callback scanning on proxy hot path (#27858)

- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead
- Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered
- Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active
- Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields
- Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk
- Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement
- Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support
- Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* ci(mutmut): enable mutate_only_covered_lines to fit in CI budget (#27910)

The mutation-test workflow timed out at the 350-minute job cap when
running whole-folder mutation against litellm/proxy/management_endpoints/
(~30 files, ~1.5 MB of source). Every mutant was running the full
test suite, and mutants were generated for lines no test covers — which
would survive regardless, just wasting compute.

mutmut 3.x's mutate_only_covered_lines setting runs the suite once up
front to compute coverage, then skips mutating uncovered lines. This
cuts the mutant count dramatically and is the right semantic for the
score (no test → no kill possible → uncountable). Per-mutant test
filtering by function name is already automatic in mutmut 3.x; no
external coverage step is needed.

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body (#27913)

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body

PR #27001 (atomic TPM rate limit) introduced a reservation flow that
writes four LiteLLM-internal keys onto the request data dict:

  _litellm_rate_limit_descriptors
  _litellm_tpm_reserved_tokens
  _litellm_tpm_reserved_model
  _litellm_tpm_reserved_scopes
  _litellm_tpm_reservation_released

These keys are forwarded as request body params to the upstream provider,
which rejects them as unknown fields:

  OpenAI    -> 400 'Unknown parameter: _litellm_rate_limit_descriptors'
              (mapped by litellm to RateLimitError / 429, hiding the bug
               behind a misleading 'throttling_error' code)
  Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are
               not permitted'

Net effect: every chat completion against any real provider fails the
moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced
key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check
itself still runs (raises 429 on over-limit), but the success path
poisons the upstream body.

Reproduced on litellm_internal_staging HEAD (410ce761dc) against
gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request
fails with the provider's unknown-field error.

Fix: the stash is metadata only.

  - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS
    registry so we have a single source of truth for stash keys.
  - New helper _stash_value_in_metadata_channels writes to
    data['metadata'] / data['litellm_metadata'] without touching the
    top level.
  - _stash_reservation_in_data and the descriptor stash now route
    through that helper. _mark_reservation_released stops writing
    top-level.
  - _lookup_stashed_value also checks kwargs['metadata'] /
    kwargs['litellm_metadata'] (raw request_data shape) in addition to
    kwargs['litellm_params']['metadata'] (completion kwargs shape).
  - async_post_call_failure_hook now reads descriptors via the unified
    metadata lookup instead of request_data.get(top-level).
  - Defense in depth: async_pre_call_hook strips any stash key that
    somehow surfaced at the top level (stale cache, future refactor,
    test fixture) before returning.

Tests:
  - New regression test asserts no _litellm_* stash key is present at
    the top level of data after async_pre_call_hook, and that the
    metadata channel still carries the reservation + descriptors so
    success / failure reconciliation works.
  - Existing test_tpm_concurrent.py tests that asserted top-level
    presence are updated to read from data['metadata'] — the location
    is an implementation detail; the spec is that post-call callbacks
    can resolve the stash.

Verified end-to-end against OpenAI gpt-4o-mini and Anthropic
claude-haiku-4-5 via /v1/chat/completions on a low-rpm key:

  - With limits not exceeded: HTTP 200, valid completion response,
    no leaked fields in body.
  - With RPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: requests').
  - With TPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: tokens').

Full v3 hook test suite passes (171 tests).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments

Address greptile P2: test fixture now uses the imported constant.
Drop comments that re-explain what well-named identifiers already convey.

* fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse

Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at
the start of async_pre_call_hook. Without this, an authenticated caller can
inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in
body metadata, trigger a proxy-side rejection, and cause
async_post_call_failure_hook to refund TPM counters against attacker-named
scopes (e.g. another tenant's api_key).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: allow for allowlisted redirect URIs (#27761)

* fix: allow for allowlisted redirect URIs

* github comment addressing

* Update litellm/proxy/_experimental/mcp_server/oauth_utils.py

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* harden oauth wildcard further

* test: cover wildcard entry with dot-leading suffix rejection

---------

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* Emit native web_search_tool_result blocks for Anthropic clients (Claude Desktop / Cowork citations) (#27886)

* feat(custom_logger): add async_post_agentic_loop_response_hook

Lets a CustomLogger shape the response returned by the agentic-loop
follow-up call without bypassing the loop's safety / observability
machinery (depth tracking, fingerprinting, etc.). Default returns the
response unchanged.

Used by websearch_interception to inject Anthropic-native
web_search_tool_result blocks when the originating client requested a
native web_search_* tool.

* feat(llm_http_handler): call post-agentic-loop hook on the originating callback

In _execute_anthropic_agentic_plan, after anthropic_messages.acreate
returns, call the originating callback's
async_post_agentic_loop_response_hook so it can mutate the final
response (e.g. inject native tool_result blocks). Pass the callback
through from _call_agentic_completion_hooks.

Exceptions in the post-hook are caught and logged so a buggy callback
can't kill the request.

* feat(websearch_interception): add is_anthropic_native_web_search_tool

Identifies tools the Anthropic-native clients (Claude Desktop, the
Anthropic SDK, the Anthropic Console) use to request native search:
type starts with "web_search_" (e.g. web_search_20250305). Rejects the
LiteLLM standard tool, the OpenAI-function variant, the bare
"WebSearch" legacy name, and the bare "web_search" Claude Code shape.

This lets us decide per-request whether the client expects
web_search_tool_result content blocks in the response, without
renaming any existing constants or touching native-provider skip
logic.

* feat(websearch_interception): add build_web_search_tool_result_block

Produces the Anthropic-native web_search_tool_result content block
from a structured SearchResponse. Anthropic-native clients use this
block to populate citations / source links — the existing text-blob
flatten path only feeds readable evidence to the model and discards
the structure, so this builder gives us the missing piece.

Shape matches https://docs.anthropic.com/en/api/web-search-tool —
web_search_result items carry url, title, page_age, encrypted_content
(empty string when the search provider doesn't supply one).

* feat(websearch_interception): emit native web_search_tool_result blocks

When the originating client request carried a native Anthropic
web_search_* tool, the final response now also carries
web_search_tool_result content blocks alongside the model's text
answer — so Claude Desktop / Anthropic SDK clients can populate the
citations panel and replay conversation history with structured search
evidence.

Wiring:
- Pre-request hooks (both deployment + Anthropic path) set a flag on
  kwargs when they see a native web_search_* tool, so the signal
  survives the conversion-to-litellm_web_search step regardless of
  which hook fires first.
- _execute_search now returns (text, SearchResponse) so the structured
  results aren't lost when the text is flattened for the follow-up
  model call.
- _build_anthropic_request_patch returns the parallel list of
  SearchResponse objects.
- async_build_agentic_loop_plan pre-builds the web_search_tool_result
  blocks (one per tool_use_id) and stashes them on plan.metadata when
  the flag is set.
- async_post_agentic_loop_response_hook reads the metadata and
  prepends the blocks to response.content.
- _execute_agentic_loop mirrors the injection for the legacy path so
  both paths behave identically.

Clients that send the LiteLLM standard tool keep the existing
text-only behavior — no regression.

* test(websearch_interception): cover native web_search_tool_result emission

18 tests across:
- detector branches (native vs litellm-standard, OpenAI-function shape,
  Claude Desktop builtin WebSearch, bare web_search, missing type)
- block-builder shape (results, none, empty)
- pre-request hook flag-setting (native sets, standard does not)
- async_build_agentic_loop_plan attaches blocks to plan.metadata when
  the flag is present, leaves metadata untouched when absent
- post-hook injection into dict and object responses
- legacy _execute_agentic_loop mirrors the injection so both paths
  return the same shape

* test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return

* test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return

* feat(websearch_interception): emit native blocks from try_short_circuit_search

The agentic-loop post-hook only fires when the model returns a tool_use
block. Cowork / Claude Desktop on Bedrock actually make TWO requests
per user turn: the main /v1/messages with their builtin tool, and a
separate standalone /v1/messages whose only tool is
web_search_20250305. That second request hits try_short_circuit_search
— no agentic loop, no post-hook — and was returning text-only, leaving
the citations panel empty.

When the short-circuit input carries a native web_search_* tool, build
a synthetic server_tool_use + web_search_tool_result pair (using the
structured SearchResponse already returned by _execute_search) so the
client gets the native shape it expects. The legacy text block is
preserved so non-native short-circuit callers (Claude Code,
github_copilot, etc.) see the same payload as before.

Failure path still emits the native block pair (with empty results)
plus the text-error block, so the client gets a well-formed response
rather than a malformed half-shape.

* test(websearch_native_blocks): cover short-circuit native-block emission

Three new cases on top of the existing 18:
- native web_search_20250305 short-circuit → [server_tool_use,
  web_search_tool_result, text], ids paired, urls/titles carried.
- litellm_web_search short-circuit → text-only (no regression).
- native short-circuit on search failure → still emits the native
  block pair (empty results) plus the text-error block, so the client
  never sees a malformed half-shape.

* test(websearch_short_circuit): index assertions by block type, not by position

Native short-circuit responses now have [server_tool_use,
web_search_tool_result, text] when the input carries
web_search_20250305 — find the text block by type rather than relying
on content[0].

* fix(websearch_interception): gate legacy WebSearch name on schema absence

Clients like Cowork / Claude Desktop ship a client-side tool named
"WebSearch" with a full input_schema — they handle it themselves and
expect to make a separate native web_search_20250305 sub-request for
the actual search.

Today is_web_search_tool matches the bare name regardless of other
fields, which hijacks the client's tool server-side. The agentic loop
fires on the main request, the model never gets to emit the
client-side tool_use, and the separate native sub-request (where
citation data flows) is never made. Net: citations panel empty.

Real Anthropic client tools always carry input_schema (the API rejects
them otherwise), so a bare {name: "WebSearch"} with no schema is the
only thing that could be a legacy interception marker. Gate the match
on schema absence: legacy callers (if any) keep working, real
client-side WebSearch tools pass through untouched.

* fix(websearch_interception): drop "WebSearch" from response-detection lists

Post-conversion the model always sees ``litellm_web_search``, so the
"WebSearch" entry in the response-side tool_use detection lists was
dead at best. If a model ever did return ``tool_use(name="WebSearch")``
it would now (incorrectly) hijack the client's own ``WebSearch`` tool
again — same Cowork problem we just fixed on the input side. Drop it.

* test(websearch_native_blocks): cover the WebSearch legacy-name schema gate

Three new cases:
- {name: "WebSearch"} (bare interception marker) → still matched
- {name: "WebSearch", input_schema: {...}} (Cowork client tool) →
  passes through untouched
- {name: "WebSearch", description: "..."} (no schema) → still matched
  on the assumption it's a legacy marker rather than a malformed real
  client tool.

---------

Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>

* ci(codecov): restore litellm/ prefix on uploaded coverage paths

pytest-cov runs with --cov=litellm, which makes coverage.xml store paths
relative to the package root (e.g. `proxy/proxy_server.py` instead of
`litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when
the basename is unique in the repo. Files like proxy_server.py, router.py,
utils.py, main.py, and constants.py — which have duplicates under
enterprise/ or other subpackages — get silently dropped during ingest.

The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded
path so they resolve unambiguously. Confirmed against multiple recent
coverage.xml artifacts that no uploader currently emits paths already
prefixed with `litellm/`, so the rule is safe to apply universally.

This restores Codecov visibility for the highest-fix-rate hotspots:
proxy_server.py, router.py, proxy/utils.py, litellm_logging.py,
constants.py, key_management_endpoints.py, utils.py, main.py,
user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py.

* chore(ci): remove unused GitHub Actions workflows and orphan files

Audit of .github/workflows/ via gh run history shows the following have
either never run or have been dormant for 10+ weeks. CI coverage that
still matters is preserved on CircleCI (e.g. llm_translation_testing).

Removed workflows:
- test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled);
  CCI local_testing_part1/2 covers the same tests
- llm-translation-testing.yml — last run 2025-07-10; replaced by CCI
  llm_translation_testing job (run_llm_translation_tests.py kept for the
  make test-llm-translation target)
- run_observatory_tests.yml — last run 2026-03-03 (cancelled)
- scan_duplicate_issues.yml — last run 2026-03-02 (failure)
- publish_to_pypi.yml — never run
- read_pyproject_version.yml — fires on every push to main but its echoed
  version output is not consumed by any downstream step

Removed orphan files (no callers in workflows, CCI, or Makefile):
- .github/workflows/README.md — documented only publish_to_pypi.yml
- .github/workflows/update_release.py + results_stats.csv
- .github/actions/helm-oci-chart-releaser/

* Revert "ci(codecov): restore litellm/ prefix on uploaded coverage paths"

This reverts commit e25a988a3feb4a31843a67274a3a64fea2fed805.

The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's
auto-resolution, not before. Files with unique basenames (which were
auto-resolving correctly to `litellm/<path>`) got an extra `litellm/`
prepended, producing `litellm/litellm/<path>` storage. Files with
ambiguous basenames (the actual target of the fix) continued to be
dropped because the auto-resolution still failed for them.

Net result on the verification run: 1375 files now stored under
unresolvable `litellm/litellm/...` paths, and the 11 originally-missing
hotspots are still missing. Reverting before piling on further changes.

* test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock

Per-file `vi.mock("@tremor/react", ...)` factories fully replace the
setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip
overrides are lost in any file that re-mocks `@tremor/react`. Without
them, the real Tremor `<Button>` leaks through and its internal
`useTooltip(300)` schedules a native 300ms `setTimeout` on pointer
events. When the test environment is torn down before the timer fires,
the trailing `setState` calls `getCurrentEventPriority`, which reads
`window.event` against a destroyed jsdom -> "window is not defined"
flake observed on CI.

Patches the 7 leaky test files to re-supply `Button` (bare `<button>`)
and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops
a dead `afterEach` workaround in `user_edit_view.test.tsx` (the
fake-timer dance it ran could not drain a real timer scheduled before
the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`.

* ci: use --cov=./litellm so coverage paths resolve unambiguously in Codecov

pytest-cov treats --cov=<module-name> as a Python package and emits XML
paths relative to the package root, stripping the litellm/ prefix
(`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`).
Codecov's auto-prefix heuristic then drops every file whose basename is
ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/),
`router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py`
(2). The 11 highest-fix-rate hotspots have never appeared in Codecov.

Switching to --cov=./litellm treats the argument as a path, which makes
coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`).
Each path is unambiguous, so Codecov resolves all files correctly.

Verified locally: rerunning a single proxy_unit_tests test with
--cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`,
`filename="litellm/router.py"`, and `filename="litellm/types/router.py"`
as distinct entries — exactly the disambiguation Codecov needs.

Touches every workflow that uploads coverage: the two reusable GHA
workflows (_test-unit-base.yml, _test-unit-services-base.yml),
test-mcp.yml, and all 14 invocations in .circleci/config.yml.

* fix(mcp): allow delegate PKCE bypass for internal MCP servers

Remove available_on_public_internet gating from delegate-auth-to-upstream
paths so oauth2 + delegate_auth_to_upstream interactive servers behave
the same when marked internal. Keeps M2M exclusion. Updates tests.

* chore(mcp): warn on internal + upstream PKCE delegate

Log verbose_logger.warning when loading oauth2 interactive servers with
available_on_public_internet=false and delegate_auth_to_upstream=true
(config + DB). Dashboard Alert for the same combo. CLAUDE note for
operators. Tests for log and M2M skip.

* fix(mcp): dedupe load_servers_from_config alias block

Removes accidental duplicate alias/mcp_aliases and get_server_prefix
logic (fixes PLR0915 and avoids resetting alias after mapping).

* fix(mcp): expose delegate_auth_to_upstream in MCP server list rows (#27936)

_build_mcp_server_table omitted delegate_auth_to_upstream, so GET /v1/mcp/server always returned the default false while the registry kept the DB value.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(proxy): fix vector store retrieve/list/update/delete without model (#27929)

* feat(proxy): fix vector store retrieve/list/update/delete routing without model

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): remove unchecked query-param injection in vector store management endpoints

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(proxy): use subset assertion for vector store route test to allow extra kwargs like shared_session

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller (#27984)

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller

CheckBatchCost bypasses async_post_call_success_hook, causing raw provider
output_file_ids to be persisted in LiteLLM_ManagedObjectTable. This fix converts
output_file_id and error_file_id to managed base64 IDs before the DB write.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(check_batch_cost): persist managed file before mutating response and propagate team_id

- Move setattr after store_unified_file_id so the response only receives the
  managed ID once the DB record is successfully written. Avoids serializing
  an orphaned managed ID into file_object when the store call fails.
- Populate team_id on the minimal UserAPIKeyAuth from job.team_id so the
  managed file record is created with the correct team ownership, allowing
  other team members to access the batch output file via /files/{id}/content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(managed_batches): extend test to cover error_file_id conversion

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix managed file test

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs (#27912)

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs

Vertex batch jobs recorded 0 spend and 0 tokens after PR #25627 added
automatic transformation of GCS predictions.jsonl to OpenAI format.

Two bugs fixed:

1. batch_utils.py: the Vertex-specific cost/usage reader
   (calculate_vertex_ai_batch_cost_and_usage) was always invoked and
   reads raw usageMetadata fields that no longer exist in the
   OpenAI-shaped output. Now the reader is only used when
   disable_vertex_batch_output_transformation=True; otherwise the
   generic path handles the already-transformed OpenAI-shaped content.

2. cost_calculator.py: batch_cost_calculator skipped the global
   litellm.get_model_info() lookup when a model_info dict was passed
   in, even when that dict had no pricing fields (e.g. deployment
   metadata with only id/db_model). It now falls back to the global
   pricing table when the provided model_info has no pricing data.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update litellm/cost_calculator.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(cost-calculator): use not-any guard for pricing fallback in batch_cost_calculator

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cost-calculator): treat explicit zero batch pricing as set in model_info

The fallback to litellm.get_model_info() used truthy checks on pricing
fields, so 0.0 was treated as missing and replaced by global rates.
Use `is not None` like elsewhere in cost calculation. Add regression test.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* feat: add weighted-routing failover (#27980)

* Feat: Add Weighted-Routing Failover

* test(router): cover weighted failover helper functions

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): align weighted failover deployment list type with mypy

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): address greptile review on weighted failover

- Narrow exception swallowing in `_maybe_run_weighted_failover` to
  `openai.APIError` so model failures defer to the regular fallback
  while programming bugs (AttributeError/KeyError/TypeError) surface.
- Note async-only limitation of `enable_weighted_failover` in the
  Router constructor docstring.
- Make the weighted distribution test less flaky (1000 iterations,
  looser bound) and make the non-simple-shuffle test deterministic by
  failing both deployments instead of relying on the latency strategy's
  first pick.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): ensure weighted failover metadata persists in kwargs

The previous `kwargs.setdefault(metadata_variable_name, {}) or {}` returned
a brand-new dict whenever the existing metadata was falsy (empty dict or
None), so writes to `_failover_excluded_ids` never made it back into
`kwargs`. Multi-hop weighted failover then re-selected previously failed
deployments and exhausted `max_fallbacks` prematurely.

Explicitly assign a fresh dict into kwargs when metadata is missing so
mutations are visible to subsequent failover hops.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(router): regression for weighted failover metadata persistence

Asserts kwargs["metadata"]["_failover_excluded_ids"] is populated after
_maybe_run_weighted_failover, proving the metadata dict written by the
helper is the same object that lives in kwargs (no disconnected copy).
Pairs with the prior fix that replaced `setdefault(..., {}) or {}` with
an explicit get/assign so writes survive across hops.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): harden weighted failover error/state handling

- Catch RouterRateLimitError (ValueError) alongside openai.APIError in
  _maybe_run_weighted_failover so an exhausted intra-group retry falls
  through to the regular cross-group fallback path instead of bubbling
  out and bypassing configured fallbacks.
- Stop mutating the shared input_kwargs dict; build a local copy with
  the weighted-failover keys so the entry (with _excluded_deployment_ids)
  cannot leak into later fallback paths reading the same dict.
- _get_excluded_filtered_deployments now returns an empty list when the
  exclusion filter removes every healthy deployment, instead of falling
  back to the original list. The original-list behavior risked re-picking
  the just-failed deployment; callers already handle the empty case by
  raising their no-deployments error, which weighted failover now catches
  and converts into a normal cross-group fallback.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): fall through to rpm/tpm when total weight is zero

When the weight metric's total is zero (e.g. after weighted-failover
exclusion leaves only zero-weight backups), continue to the next metric
(rpm/tpm) instead of returning a uniform random pick immediately. This
lets rpm/tpm still drive routing when present, and only falls back to
the uniform random pick at the end if no metric provides a positive
total weight.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): skip weighted failover when remaining deployments are all in cooldown

_maybe_run_weighted_failover was computing 'remaining' from all_deployments
(every deployment in the model group, including those in cooldown). This meant
that when all non-excluded deployments were in cooldown the method still invoked
run_async_fallback unnecessarily, which propagated into async_get_healthy_deployments,
found no eligible deployments, and raised RouterRateLimitError — only safely
caught thanks to the earlier exception-broadening fix.

The fix: before computing 'remaining', fetch the current cooldown set via
_async_get_cooldown_deployments and subtract it from all_ids. This allows
_maybe_run_weighted_failover to return None immediately (skipping the
run_async_fallback call entirely) when every non-failed deployment is in cooldown,
letting the caller fall through to the correct cross-group fallback path without
the wasteful extra round-trip.

Tests added:
- unit: _maybe_run_weighted_failover returns None without calling run_async_fallback
  when all remaining deployments are in cooldown
- unit: _maybe_run_weighted_failover still calls run_async_fallback when at least
  one healthy (non-cooldown) deployment is available
- integration: end-to-end fallthrough to cross-group fallback when remaining
  deployments are in cooldown

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpo… (#27976)

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint (#27943)

* docs: add one-line docstring to _disable_debugging (#27894)

Squash-merged by litellm-agent from oss-agent-shin's PR.

* Add jp. Bedrock cross-region inference profile for claude-sonnet-4-6 (#27831)

Squash-merged by litellm-agent from Cyberfilo's PR.

* Sanitize empty text content blocks on /v1/messages (#27832)

Squash-merged by litellm-agent from Cyberfilo's PR.

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint

The bedrock-mantle gateway (Claude Mythos Preview) serves the Anthropic
Messages API at /anthropic/v1/messages; /v1/messages returns 404 Not
Found. Both AmazonMantleConfig (chat/completions caller route) and
AmazonMantleMessagesConfig (anthropic-messages caller route) hardcoded
the wrong path, so every Mantle request 404'd before reaching the model.

Per the Anthropic docs: "[Claude in Amazon Bedrock] uses the Messages
API at /anthropic/v1/messages with SSE streaming."
https://platform.claude.com/docs/en/api/claude-on-amazon-bedrock

Confirmed independently against the live endpoint:
  /v1/chat/completions      -> 200 OK
  /v1/messages              -> 404 Not Found  (what litellm used)
  /anthropic/v1/messages    -> 200 OK         (Claude only)

Adds a regression test asserting both Mantle configs build the
/anthropic/v1/messages path, and updates the existing assertions that
encoded the wrong path.

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>

* fix: sanitize empty text blocks in sync anthropic_messages_handler path

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com>
Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(utils): import get_secret at runtime (#28014)

* fix(proxy): make /config/update env-var encryption idempotent

A single decrypt-then-encrypt chokepoint (_encrypt_env_variables_for_db)
now backs both update_config and save_config. Re-submitting a value the
Admin UI read back from /get/config/callbacks as ciphertext no longer
stacks a second encryption layer, which previously decrypted to garbage
and silently broke the callback. The chokepoint decrypts with the pure
_decrypt_db_variables (no os.environ mutation on the write path) and
encrypts exactly once; update_config merges only the sent keys so
untouched env vars keep their stored ciphertext byte-for-byte.

* test(proxy): add endpoint-level regression for /config/update double-encryption

Adds test_update_config_env_var_round_trip_not_double_encrypted, which
drives the real /config/update handler: first write plaintext, then
re-POST the stored ciphertext (the Admin UI round-trip) and assert the
value is not stacked with a second encryption layer and untouched keys
stay byte-identical. Verified to fail against the pre-fix handler and
pass after. Also tightens the unit test to exactly three ciphertext
re-feeds.

* chore(ci): modernize model references in tests and configs (#27856)

* test: modernize models used in CircleCI e2e test suites

Replaces obsolete models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo,
claude-3-5-sonnet-20240620, claude-sonnet-4-20250514) with current
equivalents across the e2e_openai_endpoints and
proxy_e2e_anthropic_messages_tests CircleCI jobs.

- gpt-4o -> gpt-5.5 (responses API e2e tests)
- gpt-4o-mini -> gpt-5-mini (websocket responses, oai_misc_config)
- gpt-4o-mini-2024-07-18 -> gpt-4.1-mini-2025-04-14 (fine-tuning,
  still actively fine-tunable)
- gpt-4 / gpt-3.5-turbo target_model_names example -> gpt-5.5 /
  gpt-5-mini
- bedrock claude-3-5-sonnet-20240620 batch entry -> haiku-4-5-20251001
  (also aligning oai_misc_config model_name with what
  test_bedrock_batches_api.py actually requests)
- bedrock claude-sonnet-4-20250514 (deprecated, retires 2026-06-15)
  -> claude-sonnet-4-5-20250929

* test: point bedrock-claude-sonnet-4 alias at Sonnet 4.6, not 4.5

Greptile/Cursor flagged that after the previous commit, the
bedrock-claude-sonnet-4 alias collided with bedrock-claude-sonnet-4.5
(both pointed to claude-sonnet-4-5-20250929). Rename to
bedrock-claude-sonnet-4.6 and point it at the Sonnet 4.6 Bedrock ID
(us.anthropic.claude-sonnet-4-6, already in the litellm model
registry) so the alias name matches the underlying model version.

* test: modernize models across remaining CI-mounted configs & tests

Expands the modernization sweep to all CircleCI-mounted proxy configs
and to test directories where the model literal is a fixture/route key
(not the test's subject).

Config changes:
- proxy_server_config.yaml: bump gpt-3.5-turbo / gpt-3.5-turbo-1106 /
  gpt-4o / gemini-1.5-flash / dall-e-3 underlying models; rename
  gpt-3.5-turbo-end-user-test alias to gpt-5-mini-end-user-test; bump
  text-embedding-ada-002 underlying to text-embedding-3-small. User-
  facing aliases (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, etc.)
  preserved for backward compatibility with tests.
- simple_config.yaml, otel_test_config.yaml, spend_tracking_config.yaml:
  bump gpt-3.5-turbo underlying to gpt-5-mini.
- pass_through_config.yaml: claude-3-5-sonnet / claude-3-7-sonnet /
  claude-3-haiku entries replaced with claude-sonnet-4-5 / claude-
  haiku-4-5 / claude-opus-4-7.
- oai_misc_config.yaml: align alias name with the gpt-5-mini rename.

Test changes (proactive: claude-sonnet-4-20250514 / claude-opus-4-
20250514 retire 2026-06-15):
- tests/llm_translation/test_anthropic_completion.py: bump 3 references
  + paired Vertex AI ID to claude-sonnet-4-5.
- tests/llm_translation/test_optional_params.py: bump 2 references.
- tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py
  and test_bedrock_anthropic_messages_test.py: bump router fixtures
  using the deprecated model IDs.
- tests/pass_through_unit_tests/base_anthropic_messages_tool_search_test.py:
  modernize docstring examples.
- tests/test_end_users.py: update references to renamed alias.

* test: modernize placeholder model literals in router_unit_tests

Mass replace_all on fixture/placeholder model literals across the
router_unit_tests/ suite (model name is a routing key / label, not the
test subject). Sub-agent sweep so far — additional commits will follow
for logging_callback_tests/, enterprise/, top-level tests/test_*.py,
and other CI-mounted dirs.

Mappings applied:
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 / claude-3-opus-20240229 /
  claude-3-haiku-20240307 / claude-3-5-sonnet-20240620 ->
  claude-sonnet-4-5-20250929 / claude-opus-4-7 /
  claude-haiku-4-5-20251001 as appropriate

Explicitly preserved:
- gpt-4o-mini-* variants (transcribe, tts, etc.) where they're current
- gpt-4-turbo / gpt-4-vision-preview / gpt-4-0613 (subject literals)
- JSONL batch body literals
- Mock LLM response model fields (must match upstream)
- Fake/mock identifiers

* test: modernize placeholder model literals across remaining CI suites

Sub-agent sweep across logging_callback_tests/, guardrails_tests/,
enterprise/, pass_through_unit_tests/, otel_tests/,
llm_responses_api_testing/, batches_tests/, spend_tracking_tests/,
litellm_utils_tests/, unified_google_tests/, and a few top-level
tests/test_*.py files where the model literal is a fixture or
placeholder (router model_list, mock standard logging payload, mock
callback data) rather than the test's subject.

Mappings applied (see scope notes below):
- gpt-3.5-turbo -> gpt-5-mini
- gpt-4 (bare) -> gpt-5.5
- gpt-4o (bare) -> gpt-5.5 (corrected from initial gpt-5 — bare gpt-5
  is not a valid OpenAI alias; only gpt-5.5 / gpt-5.4 / gpt-5.2-codex
  / gpt-5-mini exist)
- gpt-4o-mini (bare) -> gpt-5-mini
- text-embedding-ada-002 -> text-embedding-3-small
- claude-3-sonnet-20240229 -> claude-sonnet-4-5-20250929
- claude-3-opus-20240229 -> claude-opus-4-7
- claude-3-haiku-20240307 -> claude-haiku-4-5-20251001
- claude-3-5-sonnet-20240620/20241022 -> claude-sonnet-4-5-20250929
- claude-3-7-sonnet-20250219 -> claude-sonnet-4-6
- gemini-1.5-flash -> gemini-2.5-flash
- gemini-1.5-pro -> gemini-2.5-pro

Explicitly preserved (not modernized):
- llm_translation/ tests where model is the SUBJECT (provider-specific
  translation/transformation logic). Only the deprecated 20250514
  references were already bumped in a prior commit.
- Cost-calc / tokenizer subject tests in test_utils.py (skip-ranges
  documented by the sub-agent).
- Bedrock model IDs in test_health_check.py path-stripping tests.
- JSONL batch request bodies and mock LLM response bodies (must match
  upstream literal).
- Langfuse expected-request-body JSON fixtures (cost values are exact-
  match-asserted; changing the model would shift response_cost).
- gpt-3.5-turbo-instruct (text-completion endpoint; no modern OpenAI
  equivalent).
- Top-level tests calling the proxy through user-facing aliases
  (gpt-3.5-turbo, gpt-4, text-embedding-ada-002, dall-e-3) — aliases
  in proxy_server_config.yaml stay; only the underlying model was
  bumped.
- tests/test_gpt5_azure_temperature_support.py (the test's whole point
  is model-name handling).
- Fake / mock / openai/fake identifiers.

Notable side fixes:
- test_spend_accuracy_tests.py: UPSTREAM_MODEL now matches what
  spend_tracking_config.yaml's proxy actually routes to (gpt-5-mini),
  resolving a latent inconsistency.
- proxy_server_config.yaml: bare `gpt-5` alias renamed to `gpt-5.5`
  (bare gpt-5 is not a valid OpenAI alias).
- test_batches_logging_unit_tests.py: explicit_models list entries
  kept distinct (gpt-5-mini + gpt-5.5) after bulk rename.

* test: fix CI failures from model modernization sweep

CI surfaced 4 categories of regression from the bulk modernization:

1. Azure deployment names are customer-specific. Reverted:
   - tests/litellm_utils_tests/test_health_check.py: azure/text-
     embedding-3-small -> azure/text-embedding-ada-002 (the CI Azure
     account does not have a text-embedding-3-small deployment).
   - tests/logging_callback_tests/test_custom_callback_router.py:
     same revert for two router fixtures driving aembedding.

2. gpt-5 family does not accept temperature != 1. Tests that pass a
   custom temperature swapped from gpt-5-mini to gpt-4.1-mini (modern
   non-reasoning OpenAI mini that still accepts temperature/logprobs):
   - tests/logging_callback_tests/test_datadog.py
   - tests/logging_callback_tests/test_langsmith_unit_test.py
   - tests/logging_callback_tests/test_otel_logging.py

3. proxy_server_config.yaml's gpt-3.5-turbo-large alias was routing to
   gpt-5.5 (a reasoning model that rejects logprobs). The proxy test
   tests/test_openai_endpoints.py::test_chat_completion_streaming
   exercises logprobs/top_logprobs through that alias. Bumped the
   underlying model to gpt-4.1 (non-reasoning, still modern).

4. tests/logging_callback_tests/test_gcs_pub_sub.py asserts against a
   pinned JSON fixture (gcs_pub_sub_body/spend_logs_payload.json) with
   hardcoded model="gpt-4o" and a model-specific spend value. Reverted
   the litellm.acompletion calls in the test to model="gpt-4o" so the
   fixture's exact-match assertions still hold.

5. tests/pass_through_unit_tests/test_anthropic_messages_passthrough.py:
   anthropic.messages.create routing to openai/gpt-5-mini returned an
   empty content[0] with max_tokens=100 (reasoning-token consumption).
   Swapped to openai/gpt-4.1-mini.

* test: fix Assistants API model + 2 cursor[bot] review nits

1. pass_through_unit_tests/test_custom_logger_passthrough.py: gpt-5.5
   isn't accepted by the /v1/assistants endpoint
   ("unsupported_model"). Switch to gpt-4.1-mini (modern, Assistants-
   API-supported, non-reasoning).

2. example_config_yaml/pass_through_config.yaml: the previous sweep
   bumped the claude-3-7-sonnet alias to claude-opus-4-7, which is a
   tier change (Sonnet -> Opus). Map to claude-sonnet-4-6 to keep the
   Sonnet tier intact. (Cursor bugbot review.)

3. example_config_yaml/simple_config.yaml: model_name was left as
   gpt-3.5-turbo while the underlying was bumped to gpt-5-mini, which
   muddles the "simple" example. Make both sides gpt-5-mini so the
   most basic example is a straight 1:1 mapping again. (Cursor bugbot
   review.)

* fix: revert gpt-4/gpt-3.5-turbo alias underlying to non-reasoning models

tests/test_openai_endpoints.py::test_completion calls the proxy alias
"gpt-4" with temperature=0, and other tests call gpt-3.5-turbo with
custom temperature / logprobs / the legacy /v1/completions endpoint.
The earlier modernization mapped both aliases to gpt-5.5 / gpt-5-mini,
which are reasoning models that reject temperature != 1 and don't
expose /v1/completions. Map the aliases to gpt-4.1 / gpt-4.1-mini
(modern non-reasoning OpenAI models) instead — keeps user-facing
aliases preserved while picking a current underlying that still
supports the parameters/endpoints the tests exercise.

* test(proxy): isolate run_server CLI tests from prisma DB-setup path

test_keepalive_timeout_flag and test_timeout_worker_healthcheck_flag
were the only run_server tests in test_proxy_cli.py that neither
stripped DATABASE_URL/DIRECT_URL nor mocked the prisma DB path. When a
DATABASE_URL is present (CI/env leak), run_server --local enters the DB
block and blocks in the un-timeout'd subprocess.run(["prisma"]) at
proxy_cli.py:987 plus the ProxyExtrasDBManager migrate-deploy retry
loops, ~370s per test on the CI runner. --dist=loadscope pins both to
one xdist worker, so the proxy-infra job appears stuck at 99% and hits
the 20-min timeout.

Apply the same isolation every other run_server test in this file
already uses: mock PrismaManager.setup_database +
should_update_prisma_schema and strip DATABASE_URL/DIRECT_URL. Full
module drops from 31.7s to 2.9s locally; both tests fall off the slow
list.

* feat: add OTEL GenAI latest-experimental semantic convention support (#27418)

- Introduce `OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental` opt-in that switches OTEL traces to conform with the OpenTelemetry GenAI semantic conventions specification
- Extract all semconv behavior into a new `OTELGenAISemconvMixin` class in `gen_ai_semconv.py`, mixed into `OpenTelemetry` to keep concerns separated
- In semconv mode, span name follows `{operation} {model}` pattern (e.g. `chat gpt-4`) and span kind is set to `CLIENT` instead of legacy `litellm_request`
- Replace `gen_ai.system` with `gen_ai.provider.name` and drop `llm.is_streaming` in semconv mode; add `gen_ai.request.{frequency_penalty,presence_penalty,top_k,seed,stop_sequences,stream,choice.count}` and `gen_ai.usage.cache_{creation,read}.input_tokens` attributes
- Replace per-message `gen_ai.content.prompt` / per-choice `gen_ai.content.completion` log events with a single consolidated `gen_ai.client.inference.operation.details` event; omit `gen_ai.input/output.messages` when content capture is disabled
- Suppress the non-standard `raw_gen_ai_request` child span entirely in semconv mode
- Support both programmatic (`OpenTelemetryConfig.semconv_stability_opt_in` field) and environment variable activation; the two sources are unioned so either or both can enable the opt-in
- Extract OTEL SDK `LogRecord` / `SeverityNumber` version-compatibility shim into a reusable `_otel_log_types()` static method to deduplicate the `< 1.39.0` / `>= 1.39.0` import branching
- Add 30+ unit tests covering opt-in gating, span naming, attribute emission/omission rules, stop sequence normalization, cache token attributes, and the consolidated event lifecycle

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* chore: retrigger CI

* test(ci): add reasoning_effort grid v4 e2e regression suite

Encode the 231-cell QA sweep (21 provider x model combos x 11 effort
values) from #27039 / #27074 as an automated CircleCI-gated regression
suite. Each cell hits the real provider endpoint, captures the outgoing
wire body via a pre-call CustomLogger, and asserts:

- thinking.type, output_config.effort, thinking.budget_tokens, max_tokens
  in the captured request body (regression signal for silent drops/strips
  in any provider transformation)
- HTTP status (200 vs BadRequestError -> 400) returned by litellm
  (regression signal for clean-error vs leaked-500 mappings)

The matrix is encoded as a small rule set keyed by (model_mode, effort)
plus per-model xhigh/max capability overrides, then expanded across the
five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex
AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke
/v1/messages route. Cells skip at runtime when the route's provider env
vars are absent, so PR builds without credentials no-op gracefully.

Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the
existing main / litellm_* branch filter.

* fix(reasoning_effort_grid_v4): cleanup unused fixture, parse converse body, guard budget tokens

- Remove unused vertex_credentials_path fixture (and now-unused os import)
  from conftest.py.
- Parse Bedrock Converse complete_input_dict (logged as a JSON string by
  converse_handler.py) before passing to _assert_cell, so dict accessors
  work uniformly across routes.
- Extend _BUDGET_TOKENS with xhigh and max entries so the budget-mode
  branch in expected() cannot KeyError if a future budget model gains
  the matching cap.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(reasoning_effort_grid_v4): grant sonnet-4-6 entries the max-effort cap

The runtime _validate_effort_for_model allows effort='max' for any
Claude 4.6 model (opus or sonnet), and model_prices_and_context_window
sets supports_max_reasoning_effort: true for claude-sonnet-4-6. The
grid spec previously gave sonnet-4-6 entries _CAPS_NONE, so expected()
returned status=400 for effort='max', which mismatched the runtime's
status=200 and caused 6 cells (one per route) to fail.

Rename _CAPS_OPUS_4_6 to _CAPS_4_6 (since the cap set is shared by
opus and sonnet 4.6) and assign it to all sonnet-4-6 entries.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(tests): move reasoning_effort grid suite under llm_translation, drop v4 naming

- Drop the "v4" suffix throughout: it referred to the QA sweep iteration,
  not this test suite. There's only one regression suite, so just call it
  reasoning_effort_grid.
- Move tests/test…
mateo-berri added a commit that referenced this pull request May 22, 2026
* feat(guardrails): add Microsoft Purview DLP guardrail

* fix(guardrails/purview): raise_for_status on HTTP errors, cap scope cache, reuse executor

* fix(guardrails/purview): propagate litellm_call_id as correlation_id to Purview

* chore: fixes

* refactor(guardrails): delegate get_user_prompt to get_last_user_message

PurviewGuardrailBase duplicated AzureGuardrailBase (and OpenAIGuardrailBase)
user-prompt extraction. The same logic already lived in
common_utils.get_last_user_message; wire guardrail bases to that helper,
fix the helper docstring, and drop its redundant self-import of
convert_content_list_to_str.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): make protection scope cache true LRU on hits

OrderedDict.get() does not update insertion order; call move_to_end on
TTL-valid cache hits so popitem(last=False) evicts least-recently-used
users instead of FIFO by first insert.

Add a regression test with a small max cache size.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* Fix mypy

* fix(guardrails/purview): harden user-id resolution and broaden DLP text

Prefer API key and proxy-injected metadata over client metadata for Entra
identity. Scan full message transcript pre-call and all completion choices
post-call. Align logging-only hook with the same user-id rules.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(guardrails/purview): scan /v1/completions prompt and TextChoices

Normalize text-completion prompts (string or list of strings); skip token-id-only
prompts. Run post-call DLP on TextCompletionResponse choices. Extend logging_only
hook for text_completion. Add tests and completion_prompt_to_str helper.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(purview-dlp): return data after DLP pass; per-call executor; dedupe text extraction

async_pre_call_hook now returns the request dict after a successful check so
callers match skip-path behavior. logging_hook uses a fresh ThreadPoolExecutor
per invocation like Presidio to avoid single-worker starvation. Response text
extraction is centralized in _completion_response_text_parts.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): fix LRU cache refresh position and add Responses API scanning

Two fixes to the Microsoft Purview DLP guardrail:

1. LRU cache bug (base.py): When a stale scope cache entry was re-fetched,
   the assignment  updated the value but
   Python's OrderedDict.__setitem__ preserves the original insertion order for
   existing keys. This left the refreshed entry near the front of the dict,
   making it the first candidate for LRU eviction via popitem(last=False).
   Fix: call move_to_end(user_id) after every write to an existing key.

2. Responses API coverage gap (purview_dlp.py): Requests to /v1/responses use
   an 'input' field instead of 'messages' or 'prompt', so the pre-call hook
   returned without scanning the content. Similarly, post-call hook did not
   handle ResponsesAPIResponse.output. Fix: add _responses_api_input_to_str()
   helper and handle 'responses'/'aresponses' call types in async_pre_call_hook,
   async_post_call_success_hook (via _completion_response_text_parts), and
   async_logging_hook.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): message separator, non-blocking logging_hook, TextChoices type error

Three bugs fixed in the Microsoft Purview DLP guardrail:

1. get_prompt_text_for_dlp message separator (base.py)
   - Previously called get_str_from_messages() which concatenated all message
     texts with NO separator, so 'end of msg1' + 'start of msg2' became
     'end of msg1start of msg2'.
   - Now joins per-message text with '\n\n' via convert_content_list_to_str(),
     preserving DLP pattern detection accuracy across message boundaries.

2. logging_hook blocking the event loop thread (purview_dlp.py)
   - Previously called future.result() which blocked the calling thread
     (often the event loop thread) for the entire round-trip of two sequential
     Microsoft Graph API calls (_compute_protection_scopes + _process_content).
   - Now fires and forgets: when called inside a running loop, schedules the
     coroutine with loop.create_task(); otherwise spawns a daemon thread.
     Returns (kwargs, result) immediately in both cases.
   - Removes unused concurrent.futures.ThreadPoolExecutor import; adds threading.

3. Incompatible assignment type error (purview_dlp.py:180)
   - mypy inferred 'choice' as TextChoices from the first loop body, then
     flagged the assignment in the second loop as incompatible with Choices.
   - Fixed by using distinct loop variable names: text_choice (TextChoices) and
     chat_choice (Choices).

Tests: 7 new tests added covering the separator fix (TestGetPromptTextForDlp)
and the non-blocking logging_hook (TestLoggingHookNonBlocking).

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): suppress API errors in logging-only mode and scan tool-call arguments

Three issues fixed:

1. _check_content except block re-raised unconditionally even when
   block_on_violation=False. The docstring promised 'log only - do not
   raise' but network/API errors always propagated. Fixed by checking
   block_on_violation before re-raising; when False, log a warning and
   continue.

2. async_logging_hook used a single try/except wrapping both the prompt
   and response audit calls. When the first _check_content (uploadText)
   raised due to an API error the second call (downloadText) was silently
   skipped. Fixed by giving each audit call its own try/except so both
   always run independently.

3. convert_content_list_to_str() only reads message.content, so
   tool_calls[].function.arguments and function_call.arguments were
   invisible to the Purview pre-call and post-call scans. An authenticated
   caller could embed sensitive text in tool-call arguments and bypass DLP.
   Fixed by:
   - Adding PurviewGuardrailBase._extract_tool_call_args_from_message()
     which handles both dict and object-style messages, covering both
     tool_calls[] arrays and the legacy function_call field.
   - Updating get_prompt_text_for_dlp() to include those arguments
     alongside message content (request/prompt path).
   - Changing _completion_response_text_parts() from @staticmethod to an
     instance method and adding tool-call argument extraction for
     ModelResponse choices (response path).

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* chore(ui): restructure pre-built Next.js output to directory-based routing

Flat page files (e.g. guardrails.html) replaced by directory-based
index.html equivalents (e.g. guardrails/index.html) matching the
Next.js App Router output format.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(purview): comprehensive security hardening — identity spoofing, streaming bypass, token-id gap

Four security issues addressed:

1. end_user_id kwargs fallback missing in _resolve_user_id_from_logging_kwargs
   user_id already fell back to kwargs.get("user_api_key_user_id") when absent
   from metadata, but end_user_id only checked md.get("user_api_key_end_user_id")
   with no kwargs-level fallback. Added or kwargs.get("user_api_key_end_user_id").

2. Streaming responses bypassed post_call blocking
   async_post_call_success_hook only runs on assembled non-streaming responses.
   For streaming requests the proxy already delivered all content before the
   hook ran, so raising HTTPException there had no effect. Added
   async_post_call_streaming_iterator_hook which buffers the entire stream,
   assembles it via stream_chunk_builder, runs the Purview DLP check, and only
   then re-yields chunks via MockResponseIterator. If a violation is detected the
   exception is raised before any bytes reach the client. The proxy automatically
   skips async_post_call_success_hook for guardrails that define this method,
   preventing duplicate scans.

3. Caller-controlled Purview user identity in blocking modes
   When a LiteLLM API key has no bound user_id the guardrail fell back to
   metadata[user_id_field], which is supplied by the caller. A caller could set
   this to any Entra object ID whose Purview policies are more permissive and
   bypass DLP. Added _resolve_trusted_user_id() that only returns identities
   from the proxy auth system (user_api_key_dict.user_id, end_user_id, or
   proxy-injected metadata["user_api_key_user_id"]). Added
   _resolve_user_id_for_blocking() used by all blocking-mode hooks: tries
   trusted sources first; if only caller-supplied is available, logs a
   SECURITY WARNING and still proceeds (backward compat); if nothing resolves,
   skips with a warning.

4. Token-id prompt DLP bypass
   When /v1/completions received a pure token-id array prompt,
   completion_prompt_to_str() returned None and the pre_call hook silently
   skipped the Purview scan. An authenticated caller could tokenize blocked
   text and send it without DLP evaluation. The hook now detects this case
   (raw_prompt present but prompt_text None) and logs a WARNING while letting
   the request pass through — token-id payloads are opaque at the text layer
   and cannot be scanned. This makes the gap explicit rather than silent.

Tests: 94 total, all passing.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* Revert "chore(ui): restructure pre-built Next.js output to directory-based routing"

This reverts commit c70c4303b735bb3885732bd4a0e01997e9571f56.

* fix(purview): fail closed on identity spoofing, token prompts, and path encoding

Encode Entra user IDs in Graph paths, guard caches with asyncio.Lock, scan
Responses API instructions with string input, reject caller-only metadata and
token-id completion prompts in blocking mode, and revert unrelated UI HTML
restructure from the PR branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(purview): use threading.Lock and getattr for LitellmParams

- Replace asyncio.Lock with threading.Lock in PurviewGuardrailBase.
  The cache lock is acquired both from the proxy's main event loop and
  from short-lived event loops created by the logging_hook thread
  fallback. In Python 3.10+ an asyncio.Lock is bound to the first event
  loop that acquires it, so the second loop would silently break audit
  logging with RuntimeError. All critical sections are in-memory dict
  ops with no awaits, so a synchronous lock is safe.

- Use getattr() on LitellmParams in initialize_guardrail() instead of
  .get(), which does not exist on Pydantic BaseModel instances and
  would raise AttributeError at runtime. Tests updated to construct
  Mock objects with spec= so they reflect the real interface.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* refactor(purview): dedupe trust-level user resolution and drop dead code

- _resolve_user_id now delegates levels 1-3 to _resolve_trusted_user_id
  so blocking and non-blocking paths share a single source of truth.
- Drop redundant event_hook override in MicrosoftPurviewDLPGuardrail.__init__
  (initialize_guardrail already forwards event_hook=litellm_params.mode).
- Drop unused self._logging_only attribute; blocking is controlled by the
  block_on_violation argument passed to _check_content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview): fail-closed on responses API transform error; avoid duplicate audit calls

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview): fail-closed blocking DLP; revert directory-based UI HTML

Blocking hooks now require UserAPIKeyAuth user_id/end_user_id only (no
spoofable metadata), re-raise Responses API transform errors, scan streamed
text completions, and reject requests with no bound identity. Reverts the
accidental directory-based Next.js output from cc47081 (c70c4303b7).

Co-authored-by: Cursor <cursoragent@cursor.com>

* Remove dead code in purview_dlp: _resolve_user_id_for_blocking never returns falsy

The method either returns a non-empty trusted user id or raises HTTPException,
so the 'if not user_id' guards in async_pre_call_hook and async_post_call_success_hook
were unreachable. Tighten the return type to str and drop the dead checks to
make the fail-closed behavior explicit.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview): exclude caller-controlled end_user_id from blocking DLP

Blocking Purview checks now use only API-key/JWT-bound user_id, not
end_user_id populated from request user/metadata/safety_identifier.

Co-authored-by: Cursor <cursoragent@cursor.com>

* style(purview): apply Black formatting to base.py

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(purview): use post-await timestamp for cache TTL

Capture the timestamp after the network call completes when storing it
as the cache freshness marker, so the effective TTL reflects when the
response was actually received rather than when the request started.
Under high network latency the previous behavior shortened the
effective cache lifetime.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): fail closed when stream_chunk_builder returns None

stream_chunk_builder can return None (e.g., when ChunkProcessor filters
all chunks), causing both isinstance checks to fail and the buffered
chunks to be released without DLP scanning. Explicitly fail closed in
that case by raising an HTTPException so the streaming DLP guardrail
does not bypass policy enforcement.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(purview_dlp): resolve user_id before buffering stream

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* merge main (#28629)

* test(vcr): classify cache verdicts, detect live calls, surface cost leaks

Convert the per-test VCR verdict line from a single 'NOOP / HIT / MISS /
PARTIAL' tag into a classified outcome that distinguishes the cases that
silently bill the live API on every CI run from the ones that don't:

  HIT                         pure replay
  PARTIAL                     mixed replay + new recordings
  MISS:RECORDED               new cassette saved to Redis (cached next run)
  MISS:OVERFLOW               cassette > MAX_EPISODES_PER_CASSETTE; persister
                              refused to save; re-bills every run
  MISS:NOT_PERSISTED          test failed; save_cassette skipped; re-bills
  NOOP                        VCR-marked but no HTTP traffic (mocked elsewhere)
  UNMARKED:LIVE_CALL          test bypassed VCR AND opened a TCP connection
                              to a known LLM provider host -> wasted spend
  UNMARKED:NO_TRAFFIC         test bypassed VCR but didn't call out

The UNMARKED:LIVE_CALL signal is what converts 'this test probably hits
live' into 'this test connected to api.openai.com'. We install a
socket.connect / socket.create_connection wrapper for the duration of
each non-VCR-marked test and record any outbound TCP to a known LLM
provider hostname. The probe sits below the httpx layer so vcrpy and
respx (which both patch above the socket) are unaffected.

Replace the file-level _RESPX_CONFLICTING_FILES blacklists in the
llm_translation and local_testing conftests with per-item respx
detection in apply_vcr_auto_marker_to_items. A test now skips VCR when
it actually carries @pytest.mark.respx or has respx_mock in its fixture
chain - not just because some other test in the same file imports
MockRouter. Items skipped by skip_files are split into respx_conflict
(real conflict, the module wires up respx) vs file_opt_out (dead skip-
list entry whose module never touches respx) so the session summary
makes pruning obvious.

Stabilize the AWS SigV4 fingerprint: the Authorization header on
Bedrock requests rotates its Credential date and Signature on every
call, which previously pushed every Bedrock test past the 50-episode
overflow threshold. Extract the access-key id only
('aws-sigv4:AKIA...') so two requests with the same identity match.

Always emit verdict logging when VCR is active (set
LITELLM_VCR_VERBOSE=0 to opt back into the legacy quiet mode). Add a
session-end classification summary that lists overflow tests, unmarked
live-call tests, and the skip-reason breakdown.

Wire the live-call probe + summary hook into every test directory that
already uses the Redis-backed VCR cache (audio_tests, guardrails_tests,
image_gen_tests, litellm_utils_tests, llm_responses_api_testing,
llm_translation, local_testing, logging_callback_tests, ocr_tests,
pass_through_unit_tests, router_unit_tests, search_tests,
unified_google_tests).

Add tests/llm_translation/test_vcr_classification.py covering the
verdict classifier, skip-reason tagging, AWS SigV4 fingerprint stability,
live-host classification, and session summary rendering.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): drop dead 'from respx import MockRouter' imports

These seven test files were on _RESPX_CONFLICTING_FILES, which made the
auto-marker skip them entirely. Inspecting the source shows the only
respx artifact is a top-level 'from respx import MockRouter' that no
test ever uses - no @pytest.mark.respx, no respx_mock fixture, no
respx.mock context manager. The import is dead code left over from a
previous mocking pattern.

Now that apply_vcr_auto_marker_to_items detects respx per-item via the
marker / fixture chain (b637d9f64a), the file-level skip is no longer
needed for these files - they were the reason the OpenAI tests
(test_o3_reasoning_effort, test_streaming_response[o1/o3-mini],
TestOpenAIO1::test_streaming, TestOpenAIChatCompletion::test_web_search,
TestOpenAIO3::test_web_search, etc.) ran live every CI build despite
the cassette cache being healthy.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(image_edits): regenerate fixtures per call instead of holding open module-level file handles

Module-level

    TEST_IMAGES = [
        open(os.path.join(pwd, 'ishaan_github.png'), 'rb'),
        open(os.path.join(pwd, 'litellm_site.png'), 'rb'),
    ]
    SINGLE_TEST_IMAGE = open(...)

opens the file once at import. After the first multipart upload, the
file pointer is at EOF, so every subsequent test in the same xdist
worker sends an empty multipart body. That non-determinism (a) blows
the recorded cassette past MAX_EPISODES_PER_CASSETTE (50) so
_RedisPersister.save_cassette refuses to save it, and (b) re-bills the
live image edit endpoint on every CI run.

Recent CI runs confirm the leak: tests/image_gen_tests/test_image_edits.py
shows six tests parking at 51-52 cassette entries
(TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False],
TestOpenAIImageEditDallE2::..., test_openai_image_edit_with_bytesio,
test_openai_image_edit_litellm_router, test_multiple_vs_single_image_edit[False],
test_multiple_image_edit_with_different_formats).

Replace the module-level file handles with _make_test_images() /
_make_single_test_image() factories that return fresh _RewindableImage
(BytesIO subclass) objects whose pointer always starts at 0. The image
bytes are read once at import into module-level constants
(_ISHAAN_GITHUB_BYTES, _LITELLM_SITE_BYTES), so disk I/O cost is
unchanged.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): match real Bedrock hostnames in live-call probe

The suffix '.bedrock-runtime.amazonaws.com' never matched real Bedrock
endpoints, which use the format 'bedrock-runtime[-fips].{region}.amazonaws.com'
(region between 'bedrock-runtime' and 'amazonaws.com'). Add an explicit
host check for that pattern so Bedrock live calls are visible to the
probe, and update the unit test accordingly. Also drop the unused
'_LIVE_CALL_PROBE_INSTALLED' module variable.

* fix(vcr): cover full RFC1918 172.16.0.0/12 range in local prefixes

* fix(image_edits): drop _RewindableImage to prevent infinite multipart upload

The _RewindableImage(BytesIO) wrapper auto-rewound on every read after
EOF, which made the OpenAI SDK's multipart upload writer read the same
bytes forever instead of seeing EOF. Workers OOM'd / SIGKILL'd:

    [gw0] node down: Not properly terminated
    replacing crashed worker gw0
    ...
    worker 'gw1' crashed while running
        'tests/image_gen_tests/test_image_edits.py::TestOpenAIImageEditGPTImage1::test_openai_image_edit_litellm_sdk[False]'

The auto-rewind was added defensively for parametrized + flaky-retried
tests, but BaseLLMImageEditTest::test_openai_image_edit_litellm_sdk
already calls get_base_image_edit_call_args() once per invocation and
that helper now constructs fresh streams via _make_test_images(), so
rewinding inside the stream is unnecessary. Replace with plain BytesIO
seeded with the cached image bytes.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): mark Bedrock prompt-caching cross-call tests VCR-incompatible

The pass_through prompt-caching tests
(test_prompt_caching_returns_cache_read_tokens_on_second_call,
test_prompt_caching_streaming_second_call_returns_cache_read) make a
warm-up call and then assert the *second* call sees a non-zero
cache_read_input_tokens count from the upstream's prompt-cache. VCR
replay can't model cross-call provider state — both calls match the
same cassette episode, so the second call returns the first call's
pre-warmup response and the assertion fails:

    AssertionError: Expected cache_read_input_tokens > 0 on second call,
    but got 0. Full usage: {'input_tokens': 4986,
    'cache_creation_input_tokens': 4974, 'cache_read_input_tokens': 0}

This started biting after the AWS SigV4 fingerprint stabilization
(b637d9f64a): Bedrock requests now produce a stable per-access-key
fingerprint instead of a per-request signature, so cassettes
successfully replay where they previously always missed and re-recorded
live. Opt these tests out via skip_nodeid_suffixes so they run live and
match the existing pattern in tests/llm_translation/conftest.py
(::test_prompt_caching).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* test(vcr): tighten OVERFLOW classification and switch respx detection to AST

Address two greptile P2 review concerns on PR #27795:

1. MISS:OVERFLOW was firing whenever total > MAX_EPISODES_PER_CASSETTE
   regardless of cassette state. A cassette that grew past the cap
   historically but this run only *replayed* (dirty=False) is
   healthy — the persister never tries to save, so the cache state is
   stable and the next run will replay too. Only flag OVERFLOW when
   dirty=True (new episodes were recorded that the persister would
   refuse to save). Add a regression test covering the
   dirty=False + large-total case.

2. _module_uses_respx did substring matching on the module source,
   which false-positives on comments / docstrings / string literals.
   A comment like # Previously tried respx.mock but switched to
   vcrpy would keep a file pinned on the opt-out list, defeating the
   dead-import pruning goal of this PR. Replace the substring scan
   with an ast.NodeVisitor (_RespxUsageVisitor) that only
   counts:

     - @pytest.mark.respx / @respx.mock decorators
     - with respx.mock(): ... (sync + async) context managers
     - respx.mock(...) calls outside a with/decorator
     - function parameters / fixture names equal to respx_mock

   Add tests for the comment / docstring / string-literal cases plus
   each real-usage pattern.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix(vcr): aggregate worker stats on the controller so the session summary actually renders under xdist

`_session_stats` is a module-level dict mutated inside `_vcr_outcome_gate`
— which runs in each xdist worker process. The controller's
`pytest_terminal_summary` then reads its own empty `_session_stats` and
bails on `if not counts: return`, so the OVERFLOW / LIVE_CALL sections
the rest of this PR adds never make it into CI logs in the dist mode CI
actually uses.

Ship a structured `vcr_outcome` payload via `user_properties` (which
xdist round-trips) and add `aggregate_report_outcome` on the controller
to fold worker outcomes into `_session_stats`. The recording process
tags `vcr_recorded_by` with `PYTEST_XDIST_WORKER` so the controller can
tell "single-process — already counted locally" apart from "produced by
a worker — needs aggregation here", and not double-count when there's
no xdist.

Covered by 9 new unit tests in test_vcr_classification.py including the
end-to-end summary render path.

* fix(guardrails): improve CrowdStrike AIDR input handling (#26658)

* feat(lasso): add tool-calling support to LassoGuardrail (#27648)

* feat(lasso): extend LassoGuardrail to support tool calling (RND-5748)

* fix(lasso): PR review followups for tool-calling guardrail (RND-5748)

* fix(lasso): handle object-style tool_calls in _update_tool_calls_from_masked (RND-5748)

* fix(lasso): use model role for tool_use blocks (RND-5748)

* test(lasso): add round-trip tests for message transformation (RND-5748)

* fix(lasso): remove unused imports, handle Responses-API input masking, flatten multimodal content (RND-5748)

* fix(lasso): inspect Responses-API input field (RND-5748)

* fix(lasso): guard text-cursor remap against Lasso count mismatch (RND-5748)

* fix(lasso): flatten list content in tool_result.content (RND-5748)

* fix(lasso): remap multimodal list content during masking (RND-5748)

Bug: _map_masked_messages_back counted list-content messages in
original_text_count but the remap loop only handled isinstance(str).
The positional text_cursor never advanced for list messages, causing
all subsequent masked texts to be written onto the wrong messages.

Fix: added elif isinstance(content, list) branch that replaces the
list with the masked text string and advances the cursor — mirrors
the existing string-content branch. Also handles the assistant +
tool_calls combo for list-content messages.

Test: test_map_masked_messages_back_list_content verifies a user
message with [text + image_url] followed by an assistant message
gets correct masked content on both (cursor stays aligned).

* refactor(lasso): extract _get_field and _extract_tool_call_fields helpers (RND-5748)

The dict-vs-object access pattern (x.get('y') if isinstance(x, dict)
else getattr(x, 'y', None)) was duplicated 14 times across 5 methods.

_get_field(obj, field) — single-point dict/Pydantic field access.
_extract_tool_call_fields(call) — returns (call_id, name, parsed_input)
with JSON argument parsing, replacing ~30 duplicate lines in both
async_post_call_success_hook and _expand_messages_for_classification.

Also simplified _update_tool_calls_from_masked, _prepare_payload tool
mapping, and _apply_masking_to_model_response call_id extraction.

Net ~60 lines removed. No behavior change — all 32 tests pass.

* fix(lasso): add count guard to _apply_masking_to_model_response (RND-5748)

_apply_masking_to_model_response used a bare text_cursor without
verifying 1:1 correspondence between text-bearing choices and masked
text entries. If Lasso returned a different number of text messages
than choices with content, masked text would be applied to the wrong
choice or silently skip choices.

Added the same count-mismatch guard pattern already used in
_map_masked_messages_back: count original text-bearing choices,
compare to masked_text length, skip text remap on mismatch with a
warning log. Tool_call masking via id-based lookup is unaffected.

Tests:
- test_apply_masking_to_model_response_multiple_choices: verifies
  correct per-choice masked text with 2 choices
- test_apply_masking_to_model_response_count_mismatch: verifies
  content is left unchanged when counts disagree

* fix(lasso): close two guardrail-bypass paths flagged in review (RND-5748)

* tool-call args: when function.arguments is malformed JSON or parses
  to a non-object, preserve the raw string as {"arguments": <raw>} so
  Lasso still inspects it instead of receiving input=None. Covers both
  pre-call and post-call extraction (shared helper). Also resolves the
  CodeQL empty-except warning since the except body now assigns parsed=None.
* Responses-API input: when a request carries both "messages" and
  "input", inspect both. Previously a benign messages array let the
  guardrail skip data["input"] entirely. The masking write-back is
  split via a count boundary so masked messages flow back to
  data["messages"] and masked input flows back to data["input"]
  without cross-contamination.

Tests: malformed/non-object args round-trip, dual-field classification,
dual-field masking write-back split.

* chore(lasso): black formatting + comment on expand skip branch (RND-5748)

* black: wrap two long expressions in lasso.py and reformat dict
  literals in test_lasso.py to satisfy CI lint.
* add a short comment in _expand_messages_for_classification
  explaining why empty string and None content are intentionally
  skipped (None is the OpenAI shape for a pure tool-call turn).

* fix(lasso): satisfy mypy in _handle_masking, _update_tool_calls_from_masked, _apply_masking_to_model_response (RND-5748)

* Narrow `response.get("messages")` into a local before slicing so
  mypy doesn't see `Optional[List[Dict[str, str]]]` as non-indexable.
* Rename the two write-side `func` bindings in
  `_update_tool_calls_from_masked` to `func_dict` / `func_obj` so
  mypy doesn't unify the dict and Any|None branches.
* Rename the inner loop variable in `_apply_masking_to_model_response`
  from `msg` to `masked_msg` to avoid clashing with the
  `msg = choice.message` rebinding below.

No behavior change; resolves the 7 mypy errors from the CI lint job.

* perf: eliminate per-request callback scanning on proxy hot path (#27858)

- Introduce `_CallbackCapabilities` dataclass and `ProxyLogging._callback_capabilities()` static method that inspects `litellm.callbacks` once and caches capability flags keyed on (list length, member ids); invalidates automatically when the callback list mutates without per-request iteration overhead
- Replace O(n) `litellm.callbacks` walks in `async_pre_call_hook`, `during_call_hook`, `async_post_call_streaming_iterator_hook`, `async_post_call_streaming_hook`, and `post_call_response_headers_hook` with fast-path exits when no relevant callbacks are registered
- Add `needs_iterator_wrap()` and `needs_per_chunk_streaming_hook()` instance methods to decouple iterator-level wrapping from per-chunk hook execution; avoids `get_response_string` materialization per chunk when no guardrail or chunk-hook callback is active
- Introduce `_fast_serialize_simple_model_response_stream()` using `orjson` for common single-choice text streaming chunks, bypassing the full Pydantic serializer; falls back to `model_dump_json` for tool calls, logprobs, usage, and provider-specific fields
- Add early-return in `_restamp_streaming_chunk_model` when downstream model already matches the requested model, avoiding unnecessary string comparisons on every chunk
- Fix stale zero-cost cache bug in `_is_model_cost_zero`: move the per-router `_zero_cost_cache` dict onto the `Router` instance and clear it in `_invalidate_model_group_info_cache` so in-place pricing updates via `upsert_deployment` immediately resume budget enforcement
- Add `scripts/benchmark_chat_completions_perf.py`: standalone async benchmarking tool with a mock OpenAI provider, LiteLLM proxy process management, non-streaming RPS, streaming TTFT, and full-stream latency measurements with repeat/median run support
- Add comprehensive unit tests covering capability detection, cache invalidation, fast-path correctness, zero-cost cache regression, and the no-callback streaming fast path

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>

* ci(mutmut): enable mutate_only_covered_lines to fit in CI budget (#27910)

The mutation-test workflow timed out at the 350-minute job cap when
running whole-folder mutation against litellm/proxy/management_endpoints/
(~30 files, ~1.5 MB of source). Every mutant was running the full
test suite, and mutants were generated for lines no test covers — which
would survive regardless, just wasting compute.

mutmut 3.x's mutate_only_covered_lines setting runs the suite once up
front to compute coverage, then skips mutating uncovered lines. This
cuts the mutant count dramatically and is the right semantic for the
score (no test → no kill possible → uncountable). Per-mutant test
filtering by function name is already automatic in mutmut 3.x; no
external coverage step is needed.

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body (#27913)

* fix(rate-limit): stop v3 limiter from leaking internal stash to provider body

PR #27001 (atomic TPM rate limit) introduced a reservation flow that
writes four LiteLLM-internal keys onto the request data dict:

  _litellm_rate_limit_descriptors
  _litellm_tpm_reserved_tokens
  _litellm_tpm_reserved_model
  _litellm_tpm_reserved_scopes
  _litellm_tpm_reservation_released

These keys are forwarded as request body params to the upstream provider,
which rejects them as unknown fields:

  OpenAI    -> 400 'Unknown parameter: _litellm_rate_limit_descriptors'
              (mapped by litellm to RateLimitError / 429, hiding the bug
               behind a misleading 'throttling_error' code)
  Anthropic -> 400 '_litellm_rate_limit_descriptors: Extra inputs are
               not permitted'

Net effect: every chat completion against any real provider fails the
moment a virtual key has any tpm_limit / rpm_limit set — i.e. v3-enforced
key-level TPM/RPM limits are broken end-to-end. The v3 RPM/TPM check
itself still runs (raises 429 on over-limit), but the success path
poisons the upstream body.

Reproduced on litellm_internal_staging HEAD (410ce761dc) against
gpt-4o-mini and claude-haiku-4-5 with a 1-RPM/1-TPM key — first request
fails with the provider's unknown-field error.

Fix: the stash is metadata only.

  - Add RATE_LIMIT_DESCRIPTORS_KEY constant and a _LITELLM_STASH_KEYS
    registry so we have a single source of truth for stash keys.
  - New helper _stash_value_in_metadata_channels writes to
    data['metadata'] / data['litellm_metadata'] without touching the
    top level.
  - _stash_reservation_in_data and the descriptor stash now route
    through that helper. _mark_reservation_released stops writing
    top-level.
  - _lookup_stashed_value also checks kwargs['metadata'] /
    kwargs['litellm_metadata'] (raw request_data shape) in addition to
    kwargs['litellm_params']['metadata'] (completion kwargs shape).
  - async_post_call_failure_hook now reads descriptors via the unified
    metadata lookup instead of request_data.get(top-level).
  - Defense in depth: async_pre_call_hook strips any stash key that
    somehow surfaced at the top level (stale cache, future refactor,
    test fixture) before returning.

Tests:
  - New regression test asserts no _litellm_* stash key is present at
    the top level of data after async_pre_call_hook, and that the
    metadata channel still carries the reservation + descriptors so
    success / failure reconciliation works.
  - Existing test_tpm_concurrent.py tests that asserted top-level
    presence are updated to read from data['metadata'] — the location
    is an implementation detail; the spec is that post-call callbacks
    can resolve the stash.

Verified end-to-end against OpenAI gpt-4o-mini and Anthropic
claude-haiku-4-5 via /v1/chat/completions on a low-rpm key:

  - With limits not exceeded: HTTP 200, valid completion response,
    no leaked fields in body.
  - With RPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: requests').
  - With TPM exceeded: HTTP 429 from v3 enforcement
    ('Rate limit exceeded ... Limit type: tokens').

Full v3 hook test suite passes (171 tests).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* chore(rate-limit): use RATE_LIMIT_DESCRIPTORS_KEY constant in test, trim noisy comments

Address greptile P2: test fixture now uses the imported constant.
Drop comments that re-explain what well-named identifiers already convey.

* fix(rate-limit): reject caller-supplied stash values to prevent TPM-refund abuse

Strip _LITELLM_STASH_KEYS from data top-level and both metadata channels at
the start of async_pre_call_hook. Without this, an authenticated caller can
inject _litellm_rate_limit_descriptors plus _litellm_tpm_reserved_tokens in
body metadata, trigger a proxy-side rejection, and cause
async_post_call_failure_hook to refund TPM counters against attacker-named
scopes (e.g. another tenant's api_key).

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>

* fix: allow for allowlisted redirect URIs (#27761)

* fix: allow for allowlisted redirect URIs

* github comment addressing

* Update litellm/proxy/_experimental/mcp_server/oauth_utils.py

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* harden oauth wildcard further

* test: cover wildcard entry with dot-leading suffix rejection

---------

Co-authored-by: veria-ai[bot] <224490171+veria-ai[bot]@users.noreply.github.com>

* Emit native web_search_tool_result blocks for Anthropic clients (Claude Desktop / Cowork citations) (#27886)

* feat(custom_logger): add async_post_agentic_loop_response_hook

Lets a CustomLogger shape the response returned by the agentic-loop
follow-up call without bypassing the loop's safety / observability
machinery (depth tracking, fingerprinting, etc.). Default returns the
response unchanged.

Used by websearch_interception to inject Anthropic-native
web_search_tool_result blocks when the originating client requested a
native web_search_* tool.

* feat(llm_http_handler): call post-agentic-loop hook on the originating callback

In _execute_anthropic_agentic_plan, after anthropic_messages.acreate
returns, call the originating callback's
async_post_agentic_loop_response_hook so it can mutate the final
response (e.g. inject native tool_result blocks). Pass the callback
through from _call_agentic_completion_hooks.

Exceptions in the post-hook are caught and logged so a buggy callback
can't kill the request.

* feat(websearch_interception): add is_anthropic_native_web_search_tool

Identifies tools the Anthropic-native clients (Claude Desktop, the
Anthropic SDK, the Anthropic Console) use to request native search:
type starts with "web_search_" (e.g. web_search_20250305). Rejects the
LiteLLM standard tool, the OpenAI-function variant, the bare
"WebSearch" legacy name, and the bare "web_search" Claude Code shape.

This lets us decide per-request whether the client expects
web_search_tool_result content blocks in the response, without
renaming any existing constants or touching native-provider skip
logic.

* feat(websearch_interception): add build_web_search_tool_result_block

Produces the Anthropic-native web_search_tool_result content block
from a structured SearchResponse. Anthropic-native clients use this
block to populate citations / source links — the existing text-blob
flatten path only feeds readable evidence to the model and discards
the structure, so this builder gives us the missing piece.

Shape matches https://docs.anthropic.com/en/api/web-search-tool —
web_search_result items carry url, title, page_age, encrypted_content
(empty string when the search provider doesn't supply one).

* feat(websearch_interception): emit native web_search_tool_result blocks

When the originating client request carried a native Anthropic
web_search_* tool, the final response now also carries
web_search_tool_result content blocks alongside the model's text
answer — so Claude Desktop / Anthropic SDK clients can populate the
citations panel and replay conversation history with structured search
evidence.

Wiring:
- Pre-request hooks (both deployment + Anthropic path) set a flag on
  kwargs when they see a native web_search_* tool, so the signal
  survives the conversion-to-litellm_web_search step regardless of
  which hook fires first.
- _execute_search now returns (text, SearchResponse) so the structured
  results aren't lost when the text is flattened for the follow-up
  model call.
- _build_anthropic_request_patch returns the parallel list of
  SearchResponse objects.
- async_build_agentic_loop_plan pre-builds the web_search_tool_result
  blocks (one per tool_use_id) and stashes them on plan.metadata when
  the flag is set.
- async_post_agentic_loop_response_hook reads the metadata and
  prepends the blocks to response.content.
- _execute_agentic_loop mirrors the injection for the legacy path so
  both paths behave identically.

Clients that send the LiteLLM standard tool keep the existing
text-only behavior — no regression.

* test(websearch_interception): cover native web_search_tool_result emission

18 tests across:
- detector branches (native vs litellm-standard, OpenAI-function shape,
  Claude Desktop builtin WebSearch, bare web_search, missing type)
- block-builder shape (results, none, empty)
- pre-request hook flag-setting (native sets, standard does not)
- async_build_agentic_loop_plan attaches blocks to plan.metadata when
  the flag is present, leaves metadata untouched when absent
- post-hook injection into dict and object responses
- legacy _execute_agentic_loop mirrors the injection so both paths
  return the same shape

* test(websearch_short_circuit): keep _execute_search mocks in sync with new tuple return

* test(websearch_thinking_constraint): keep _execute_search mocks in sync with new tuple return

* feat(websearch_interception): emit native blocks from try_short_circuit_search

The agentic-loop post-hook only fires when the model returns a tool_use
block. Cowork / Claude Desktop on Bedrock actually make TWO requests
per user turn: the main /v1/messages with their builtin tool, and a
separate standalone /v1/messages whose only tool is
web_search_20250305. That second request hits try_short_circuit_search
— no agentic loop, no post-hook — and was returning text-only, leaving
the citations panel empty.

When the short-circuit input carries a native web_search_* tool, build
a synthetic server_tool_use + web_search_tool_result pair (using the
structured SearchResponse already returned by _execute_search) so the
client gets the native shape it expects. The legacy text block is
preserved so non-native short-circuit callers (Claude Code,
github_copilot, etc.) see the same payload as before.

Failure path still emits the native block pair (with empty results)
plus the text-error block, so the client gets a well-formed response
rather than a malformed half-shape.

* test(websearch_native_blocks): cover short-circuit native-block emission

Three new cases on top of the existing 18:
- native web_search_20250305 short-circuit → [server_tool_use,
  web_search_tool_result, text], ids paired, urls/titles carried.
- litellm_web_search short-circuit → text-only (no regression).
- native short-circuit on search failure → still emits the native
  block pair (empty results) plus the text-error block, so the client
  never sees a malformed half-shape.

* test(websearch_short_circuit): index assertions by block type, not by position

Native short-circuit responses now have [server_tool_use,
web_search_tool_result, text] when the input carries
web_search_20250305 — find the text block by type rather than relying
on content[0].

* fix(websearch_interception): gate legacy WebSearch name on schema absence

Clients like Cowork / Claude Desktop ship a client-side tool named
"WebSearch" with a full input_schema — they handle it themselves and
expect to make a separate native web_search_20250305 sub-request for
the actual search.

Today is_web_search_tool matches the bare name regardless of other
fields, which hijacks the client's tool server-side. The agentic loop
fires on the main request, the model never gets to emit the
client-side tool_use, and the separate native sub-request (where
citation data flows) is never made. Net: citations panel empty.

Real Anthropic client tools always carry input_schema (the API rejects
them otherwise), so a bare {name: "WebSearch"} with no schema is the
only thing that could be a legacy interception marker. Gate the match
on schema absence: legacy callers (if any) keep working, real
client-side WebSearch tools pass through untouched.

* fix(websearch_interception): drop "WebSearch" from response-detection lists

Post-conversion the model always sees ``litellm_web_search``, so the
"WebSearch" entry in the response-side tool_use detection lists was
dead at best. If a model ever did return ``tool_use(name="WebSearch")``
it would now (incorrectly) hijack the client's own ``WebSearch`` tool
again — same Cowork problem we just fixed on the input side. Drop it.

* test(websearch_native_blocks): cover the WebSearch legacy-name schema gate

Three new cases:
- {name: "WebSearch"} (bare interception marker) → still matched
- {name: "WebSearch", input_schema: {...}} (Cowork client tool) →
  passes through untouched
- {name: "WebSearch", description: "..."} (no schema) → still matched
  on the assumption it's a legacy marker rather than a malformed real
  client tool.

---------

Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>

* ci(codecov): restore litellm/ prefix on uploaded coverage paths

pytest-cov runs with --cov=litellm, which makes coverage.xml store paths
relative to the package root (e.g. `proxy/proxy_server.py` instead of
`litellm/proxy/proxy_server.py`). Codecov auto-resolves these only when
the basename is unique in the repo. Files like proxy_server.py, router.py,
utils.py, main.py, and constants.py — which have duplicates under
enterprise/ or other subpackages — get silently dropped during ingest.

The `fixes: ["::litellm/"]` rule prepends `litellm/` to every uploaded
path so they resolve unambiguously. Confirmed against multiple recent
coverage.xml artifacts that no uploader currently emits paths already
prefixed with `litellm/`, so the rule is safe to apply universally.

This restores Codecov visibility for the highest-fix-rate hotspots:
proxy_server.py, router.py, proxy/utils.py, litellm_logging.py,
constants.py, key_management_endpoints.py, utils.py, main.py,
user_api_key_auth.py, team_endpoints.py, and litellm_pre_call_utils.py.

* chore(ci): remove unused GitHub Actions workflows and orphan files

Audit of .github/workflows/ via gh run history shows the following have
either never run or have been dormant for 10+ weeks. CI coverage that
still matters is preserved on CircleCI (e.g. llm_translation_testing).

Removed workflows:
- test-litellm.yml — workflow_dispatch only, last run 2026-02-12 (cancelled);
  CCI local_testing_part1/2 covers the same tests
- llm-translation-testing.yml — last run 2025-07-10; replaced by CCI
  llm_translation_testing job (run_llm_translation_tests.py kept for the
  make test-llm-translation target)
- run_observatory_tests.yml — last run 2026-03-03 (cancelled)
- scan_duplicate_issues.yml — last run 2026-03-02 (failure)
- publish_to_pypi.yml — never run
- read_pyproject_version.yml — fires on every push to main but its echoed
  version output is not consumed by any downstream step

Removed orphan files (no callers in workflows, CCI, or Makefile):
- .github/workflows/README.md — documented only publish_to_pypi.yml
- .github/workflows/update_release.py + results_stats.csv
- .github/actions/helm-oci-chart-releaser/

* Revert "ci(codecov): restore litellm/ prefix on uploaded coverage paths"

This reverts commit e25a988a3feb4a31843a67274a3a64fea2fed805.

The `fixes: ["::litellm/"]` rule turned out to be applied *after* Codecov's
auto-resolution, not before. Files with unique basenames (which were
auto-resolving correctly to `litellm/<path>`) got an extra `litellm/`
prepended, producing `litellm/litellm/<path>` storage. Files with
ambiguous basenames (the actual target of the fix) continued to be
dropped because the auto-resolution still failed for them.

Net result on the verification run: 1375 files now stored under
unresolvable `litellm/litellm/...` paths, and the 11 originally-missing
hotspots are still missing. Reverting before piling on further changes.

* test(ui): preserve global Button/Tooltip mocks in per-file @tremor/react vi.mock

Per-file `vi.mock("@tremor/react", ...)` factories fully replace the
setup-level mock from `tests/setupTests.ts`, so the global Button/Tooltip
overrides are lost in any file that re-mocks `@tremor/react`. Without
them, the real Tremor `<Button>` leaks through and its internal
`useTooltip(300)` schedules a native 300ms `setTimeout` on pointer
events. When the test environment is torn down before the timer fires,
the trailing `setState` calls `getCurrentEventPriority`, which reads
`window.event` against a destroyed jsdom -> "window is not defined"
flake observed on CI.

Patches the 7 leaky test files to re-supply `Button` (bare `<button>`)
and `Tooltip` (Fragment) overrides matching `setupTests.ts`. Also drops
a dead `afterEach` workaround in `user_edit_view.test.tsx` (the
fake-timer dance it ran could not drain a real timer scheduled before
the swap) and corrects a misleading comment in `MakeMCPPublicForm.test.tsx`.

* ci: use --cov=./litellm so coverage paths resolve unambiguously in Codecov

pytest-cov treats --cov=<module-name> as a Python package and emits XML
paths relative to the package root, stripping the litellm/ prefix
(`proxy/proxy_server.py` instead of `litellm/proxy/proxy_server.py`).
Codecov's auto-prefix heuristic then drops every file whose basename is
ambiguous in the repo — `proxy_server.py` (3 copies under enterprise/),
`router.py` (2 copies), `utils.py` (20+), `main.py` (20+), `constants.py`
(2). The 11 highest-fix-rate hotspots have never appeared in Codecov.

Switching to --cov=./litellm treats the argument as a path, which makes
coverage.xml emit repo-relative paths (`litellm/proxy/proxy_server.py`).
Each path is unambiguous, so Codecov resolves all files correctly.

Verified locally: rerunning a single proxy_unit_tests test with
--cov=./litellm produced `filename="litellm/proxy/proxy_server.py"`,
`filename="litellm/router.py"`, and `filename="litellm/types/router.py"`
as distinct entries — exactly the disambiguation Codecov needs.

Touches every workflow that uploads coverage: the two reusable GHA
workflows (_test-unit-base.yml, _test-unit-services-base.yml),
test-mcp.yml, and all 14 invocations in .circleci/config.yml.

* fix(mcp): allow delegate PKCE bypass for internal MCP servers

Remove available_on_public_internet gating from delegate-auth-to-upstream
paths so oauth2 + delegate_auth_to_upstream interactive servers behave
the same when marked internal. Keeps M2M exclusion. Updates tests.

* chore(mcp): warn on internal + upstream PKCE delegate

Log verbose_logger.warning when loading oauth2 interactive servers with
available_on_public_internet=false and delegate_auth_to_upstream=true
(config + DB). Dashboard Alert for the same combo. CLAUDE note for
operators. Tests for log and M2M skip.

* fix(mcp): dedupe load_servers_from_config alias block

Removes accidental duplicate alias/mcp_aliases and get_server_prefix
logic (fixes PLR0915 and avoids resetting alias after mapping).

* fix(mcp): expose delegate_auth_to_upstream in MCP server list rows (#27936)

_build_mcp_server_table omitted delegate_auth_to_upstream, so GET /v1/mcp/server always returned the default false while the registry kept the DB value.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(proxy): fix vector store retrieve/list/update/delete without model (#27929)

* feat(proxy): fix vector store retrieve/list/update/delete routing without model

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(proxy): remove unchecked query-param injection in vector store management endpoints

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(proxy): use subset assertion for vector store route test to allow extra kwargs like shared_session

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller (#27984)

* fix(managed_batches): convert raw output_file_id to managed ID in CheckBatchCost poller

CheckBatchCost bypasses async_post_call_success_hook, causing raw provider
output_file_ids to be persisted in LiteLLM_ManagedObjectTable. This fix converts
output_file_id and error_file_id to managed base64 IDs before the DB write.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(check_batch_cost): persist managed file before mutating response and propagate team_id

- Move setattr after store_unified_file_id so the response only receives the
  managed ID once the DB record is successfully written. Avoids serializing
  an orphaned managed ID into file_object when the store call fails.
- Populate team_id on the minimal UserAPIKeyAuth from job.team_id so the
  managed file record is created with the correct team ownership, allowing
  other team members to access the batch output file via /files/{id}/content.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(managed_batches): extend test to cover error_file_id conversion

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix managed file test

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs (#27912)

* fix(vertex-ai): fix zero cost/usage on completed Vertex AI batch jobs

Vertex batch jobs recorded 0 spend and 0 tokens after PR #25627 added
automatic transformation of GCS predictions.jsonl to OpenAI format.

Two bugs fixed:

1. batch_utils.py: the Vertex-specific cost/usage reader
   (calculate_vertex_ai_batch_cost_and_usage) was always invoked and
   reads raw usageMetadata fields that no longer exist in the
   OpenAI-shaped output. Now the reader is only used when
   disable_vertex_batch_output_transformation=True; otherwise the
   generic path handles the already-transformed OpenAI-shaped content.

2. cost_calculator.py: batch_cost_calculator skipped the global
   litellm.get_model_info() lookup when a model_info dict was passed
   in, even when that dict had no pricing fields (e.g. deployment
   metadata with only id/db_model). It now falls back to the global
   pricing table when the provided model_info has no pricing data.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update litellm/cost_calculator.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix(cost-calculator): use not-any guard for pricing fallback in batch_cost_calculator

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(cost-calculator): treat explicit zero batch pricing as set in model_info

The fallback to litellm.get_model_info() used truthy checks on pricing
fields, so 0.0 was treated as missing and replaced by global rates.
Use `is not None` like elsewhere in cost calculation. Add regression test.

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* feat: add weighted-routing failover (#27980)

* Feat: Add Weighted-Routing Failover

* test(router): cover weighted failover helper functions

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): align weighted failover deployment list type with mypy

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): address greptile review on weighted failover

- Narrow exception swallowing in `_maybe_run_weighted_failover` to
  `openai.APIError` so model failures defer to the regular fallback
  while programming bugs (AttributeError/KeyError/TypeError) surface.
- Note async-only limitation of `enable_weighted_failover` in the
  Router constructor docstring.
- Make the weighted distribution test less flaky (1000 iterations,
  looser bound) and make the non-simple-shuffle test deterministic by
  failing both deployments instead of relying on the latency strategy's
  first pick.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): ensure weighted failover metadata persists in kwargs

The previous `kwargs.setdefault(metadata_variable_name, {}) or {}` returned
a brand-new dict whenever the existing metadata was falsy (empty dict or
None), so writes to `_failover_excluded_ids` never made it back into
`kwargs`. Multi-hop weighted failover then re-selected previously failed
deployments and exhausted `max_fallbacks` prematurely.

Explicitly assign a fresh dict into kwargs when metadata is missing so
mutations are visible to subsequent failover hops.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(router): regression for weighted failover metadata persistence

Asserts kwargs["metadata"]["_failover_excluded_ids"] is populated after
_maybe_run_weighted_failover, proving the metadata dict written by the
helper is the same object that lives in kwargs (no disconnected copy).
Pairs with the prior fix that replaced `setdefault(..., {}) or {}` with
an explicit get/assign so writes survive across hops.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(router): harden weighted failover error/state handling

- Catch RouterRateLimitError (ValueError) alongside openai.APIError in
  _maybe_run_weighted_failover so an exhausted intra-group retry falls
  through to the regular cross-group fallback path instead of bubbling
  out and bypassing configured fallbacks.
- Stop mutating the shared input_kwargs dict; build a local copy with
  the weighted-failover keys so the entry (with _excluded_deployment_ids)
  cannot leak into later fallback paths reading the same dict.
- _get_excluded_filtered_deployments now returns an empty list when the
  exclusion filter removes every healthy deployment, instead of falling
  back to the original list. The original-list behavior risked re-picking
  the just-failed deployment; callers already handle the empty case by
  raising their no-deployments error, which weighted failover now catches
  and converts into a normal cross-group fallback.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): fall through to rpm/tpm when total weight is zero

When the weight metric's total is zero (e.g. after weighted-failover
exclusion leaves only zero-weight backups), continue to the next metric
(rpm/tpm) instead of returning a uniform random pick immediately. This
lets rpm/tpm still drive routing when present, and only falls back to
the uniform random pick at the end if no metric provides a positive
total weight.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(router): skip weighted failover when remaining deployments are all in cooldown

_maybe_run_weighted_failover was computing 'remaining' from all_deployments
(every deployment in the model group, including those in cooldown). This meant
that when all non-excluded deployments were in cooldown the method still invoked
run_async_fallback unnecessarily, which propagated into async_get_healthy_deployments,
found no eligible deployments, and raised RouterRateLimitError — only safely
caught thanks to the earlier exception-broadening fix.

The fix: before computing 'remaining', fetch the current cooldown set via
_async_get_cooldown_deployments and subtract it from all_ids. This allows
_maybe_run_weighted_failover to return None immediately (skipping the
run_async_fallback call entirely) when every non-failed deployment is in cooldown,
letting the caller fall through to the correct cross-group fallback path without
the wasteful extra round-trip.

Tests added:
- unit: _maybe_run_weighted_failover returns None without calling run_async_fallback
  when all remaining deployments are in cooldown
- unit: _maybe_run_weighted_failover still calls run_async_fallback when at least
  one healthy (non-cooldown) deployment is available
- integration: end-to-end fallthrough to cross-group fallback when remaining
  deployments are in cooldown

Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Sameer Kankute <Sameerlite@users.noreply.github.com>

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpo… (#27976)

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint (#27943)

* docs: add one-line docstring to _disable_debugging (#27894)

Squash-merged by litellm-agent from oss-agent-shin's PR.

* Add jp. Bedrock cross-region inference profile for claude-sonnet-4-6 (#27831)

Squash-merged by litellm-agent from Cyberfilo's PR.

* Sanitize empty text content blocks on /v1/messages (#27832)

Squash-merged by litellm-agent from Cyberfilo's PR.

* fix(bedrock-mantle): use /anthropic/v1/messages path for Mantle endpoint

The bedrock-mantle gateway (Claude Mythos Preview) serves the Anthropic
Messages API at /anthropic/v1/messages; /v1/messages returns 404 Not
Found. Both AmazonMantleConfig (chat/completions caller route) and
AmazonMantleMessagesConfig (anthropic-messages caller route) hardcoded
the wrong path, so every Mantle request 404'd before reaching the model.

Per the Anthropic docs: "[Claude in Amazon Bedrock] uses the Messages
API at /anthropic/v1/messages with SSE streaming."
https://platform.claude.com/docs/en/api/claude-on-amazon-bedrock

Confirmed independently against the live endpoint:
  /v1/chat/completions      -> 200 OK
  /v1/messages              -> 404 Not Found  (what litellm used)
  /anthropic/v1/messages    -> 200 OK         (Claude only)

Adds a regression test asserting both Mantle configs build the
/anthropic/v1/messages path, and updates the existing assertions that
encoded the wrong path.

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>

* fix: sanitize empty text blocks in sync anthropic_messages_handler path

Co-authored-by: Yassin Kortam <yassin@berri.ai>

---------

Co-authored-by: João Costa <13508071+jpv-costa@users.noreply.github.com>
Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(utils): import get_secret at runtime (#28014)

* fix(proxy): make /config/update env-var encryption idempotent

A single decrypt-then-encrypt chokepoint (_encrypt_env_variables_for_db)
now backs both update_config and save_config. Re-submitting a value the
Admin UI read back from /get/config/callbacks as ciphertext no longer
stacks a second encryption layer, which previously decrypted to garbage
and silently broke the callback. The chokepoint decrypts with the pure
_decrypt_db_variables (no os.environ mutation on the write path) and
encrypts exactly once; update_config merges only the sent keys so
untouched env vars keep their stored ciphertext byte-for-byte.

* test(proxy): add endpoint-level regression for /config/update double-encryption

Adds test_update_config_env_var_round_trip_not_double_encrypted, which
drives the real /config/update handler: first write plaintext, then
re-POST the stored ciphertext (the Admin UI round-trip) and assert the
value is not stacked with a second encryption layer and untouched keys
stay byte-identical. Verified to fail against the pre-fix handler and
pass after. Also tightens the unit test to exactly three ciphertext
re-feeds.

* chore(ci): modernize model references in tests and configs (#27856)

* test: modernize models used in CircleCI e2e test suites

Replaces obsolete models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo,
claude-3-5-sonnet-20240620, claude-sonnet-4-20250514) with current
equivalents across the e2e_openai_endpoints and
proxy_e2e_anthropic_messages_tests CircleCI jobs.

- gpt-4o -> gpt-5.5 (responses API e2e tests)
- gpt-4o-mini -> gpt-5-mini (websocket responses, oai_misc_config)
- gpt-4o-mini-2024-07-18 -> gpt-4.1-mini-2025-04-14 (fine-tuning,
  still actively fine-tunable)
- gpt-4 / gpt-3.5-turbo target_model_names example -> gpt-5.5 /
  gpt-5-mini
- bedrock claude-3-5-sonnet-20240620 batch entry -> haiku-4-5-20251001
  (also aligning oai_misc_config model_name with what
  test_bedrock_batches_api.py actually requests)
- bedrock claude-sonnet-4-20250514 (deprecated, retires 2026-06-15)
  -> claude-sonnet-4-5-20250929

* test: point bedrock-claude-sonnet-4 alias at Sonnet 4.6, not 4.5

Greptile/Cursor flagged that after the previous commit, the
bedrock-claude-sonnet-4 alias collided with bedrock-claude-sonnet-4.5
(both pointed to claude-sonnet-4-5-20250929). Rename to
bedrock-claude-sonnet-4.6 and point it at the Sonnet 4.6 Bedrock ID
(us.anthropic.claude-sonnet-4-6, already in the litellm model
registry) so the alias name matches the underlying model version.

* test: modernize models across remaining CI-mounted configs & tests

Expands the modernization sweep to all CircleCI-mounted proxy configs
and to test directories where the model literal is a fixture/route key
(not the test's subject).

Config changes:
- pro…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants