Skip to content

[litellm-agent] Staging → litellm_internal_staging (5/22/2026)#28542

Closed
oss-pr-review-agent-shin[bot] wants to merge 10 commits into
litellm_internal_stagingfrom
shin_agent_oss_staging_05_22_2026
Closed

[litellm-agent] Staging → litellm_internal_staging (5/22/2026)#28542
oss-pr-review-agent-shin[bot] wants to merge 10 commits into
litellm_internal_stagingfrom
shin_agent_oss_staging_05_22_2026

Conversation

@oss-pr-review-agent-shin

Copy link
Copy Markdown
Contributor

Automated staging PR created by litellm-agent.

This branch collects PRs approved by the agent on 5/22/2026.

⚠️ Human review required before CI. Convert from draft to ready when you've reviewed the diff.

TorvaldUtne and others added 10 commits May 19, 2026 02:38
…#27700)

Squash-merged by litellm-agent from TorvaldUtne's PR.
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
* feat(model_prices): add gemini-3.1-flash-lite pricing with standard/batch/flex/priority tiers

* fix pricing

* add service tier

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
…dge (#28201)

* fix(anthropic): accept dict-shape reasoning_effort from Responses bridge

Issue #28196 — the Responses->Chat parser (transformation.py:184-200) keeps the full dict as reasoning_effort when summary is set; that branch was added in #25359. But the Anthropic transformation here still guarded on isinstance(value, str), silently dropping the param. Result: callers using the standard Reasoning(effort, summary) OpenAI-shaped object on Anthropic lose thinking entirely (0 reasoning_tokens, no thinking_blocks).

Coerce dict -> string before mapping. Same shape tolerance that gpt_5_transformation._normalize_reasoning_effort_for_chat_completion already implements. summary is irrelevant for Anthropic's thinking_blocks.

Adds two regression tests: one parametrized over string + dict shapes (with and without summary), one covering unparseable dict inputs (drops silently, no crash).

* test(anthropic): add non-adaptive model coverage for dict-shape reasoning_effort

Per Greptile feedback on PR #28198: the original regression test only exercised the adaptive (4.6+) path. Add a parametrized test for the non-adaptive branch (claude-sonnet-4-5) verifying that dict-shape reasoning_effort still maps to thinking.type='enabled' + budget_tokens, and that output_config is NOT set on pre-4.6 models.

* test(anthropic): convert unparseable-dict test to @pytest.mark.parametrize

Per @greptile-apps inline review on PR #28201 — matches the parametrize style of the two adjacent dict-shape tests and produces clearer failure messages (test ID per case instead of one collapsing for-loop).
…cks (#28215)

Squash-merged by litellm-agent from cwang-otto's PR.
…ng fallback (#28318)

Squash-merged by litellm-agent from cwang-otto's PR.
…PI requests (#28431)

Squash-merged by litellm-agent from cwang-otto's PR.
)

Squash-merged by litellm-agent from adityasingh2400's PR.
@oss-pr-review-agent-shin

Copy link
Copy Markdown
Contributor Author

@greptile please review

@CLAassistant

CLAassistant commented May 22, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 7 committers have signed the CLA.

✅ TorvaldUtne
✅ IshaMeera
✅ ro31337
✅ cwang-otto
✅ mubashir1osmani
❌ oss-agent-shin
❌ adityasingh2400
You have signed the CLA already but the status is still pending? Let us recheck it.

@greptile-apps

greptile-apps Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This automated staging PR bundles five independent fixes: mid-stream fallback support for the Responses API streaming path in Router, Anthropic reasoning_effort dict-shape handling, cache_control stripping for OpenAI's Responses API, a custom pricing regression fix in utils.py, and new gemini-3.1-flash-lite (GA) pricing entries.

  • router.py: Adds _aresponses_with_streaming_fallbacks and _aresponses_streaming_iterator to give the Responses API path the same mid-stream fallback chain that chat completions already had. An inner FallbackResponsesStreamWrapper subclasses BaseResponsesAPIStreamingIterator without calling super().__init__() and manually copies all expected attributes from the source iterator.
  • llms/anthropic/chat/transformation.py: Widens the reasoning_effort parameter to accept {\"effort\": \"low\", \"summary\": \"concise\"} dict shapes produced by the Responses→Chat bridge, coercing to the string form before mapping.
  • llms/openai/responses/transformation.py: Strips Anthropic-only cache_control markers from input content blocks and tools before sending to OpenAI, mirroring the existing chat-completions strip logic — note the strip mutates the caller's input in-place.
  • utils.py: Two-part fix so register_model no longer persists litellm_provider: None into model_cost, and _check_provider_match treats None as a wildcard, restoring custom pricing for provider-less deployments.

Confidence Score: 4/5

Safe to merge after a second look at the two areas noted; all other changes are well-contained with good test coverage.

The Responses API streaming fallback machinery in router.py is a substantial new code path. FallbackResponsesStreamWrapper skips the base-class constructor and clones ~15 attributes by hand — any future attribute addition to BaseResponsesAPIStreamingIterator will silently be absent from the wrapper. The in-place mutation of the caller's input list in remove_cache_control_flag_from_input_and_tools is a latent hazard for provider-fallback retry flows where the same input object is reused. Both issues are non-blocking today but worth tracking. The rest of the changes (reasoning_effort widening, pricing fix, model prices, UI trim) are narrow and thoroughly tested.

litellm/router.py (manual attribute init in FallbackResponsesStreamWrapper) and litellm/llms/openai/responses/transformation.py (in-place mutation of input in remove_cache_control_flag_from_input_and_tools).

Important Files Changed

Filename Overview
litellm/llms/anthropic/chat/transformation.py Relaxes the reasoning_effort guard from isinstance(value, str) to also accept dict shape {"effort": "...", "summary": "..."} produced by the Responses→Chat bridge, coercing to the effort string before mapping. Fix is narrow and well-tested.
litellm/llms/openai/responses/transformation.py Adds remove_cache_control_flag_from_input_and_tools to strip Anthropic-only cache_control markers before sending to OpenAI's Responses API. Correct fix but filter_value_from_dict mutates the caller's input in-place, which is a latent issue for retry scenarios.
litellm/router.py Adds mid-stream fallback handling for the Responses API path (_aresponses_streaming_iterator, _aresponses_with_streaming_fallbacks, and three static helpers). Large addition with comprehensive tests; FallbackResponsesStreamWrapper bypasses super().__init__() creating a manual attribute surface that could drift from the base class.
litellm/utils.py Two-part fix for custom pricing regression (#28336): strips litellm_provider: None from existing_model in register_model, and treats None as a wildcard match in _check_provider_match. Both changes are safe and well-tested.
model_prices_and_context_window.json Adds stable (GA) gemini-3.1-flash-lite entries for bare, gemini/, vertex_ai/, and openrouter/google/ prefixes. Pricing is consistent with the existing preview variant.
ui/litellm-dashboard/src/components/mcp_tools/ToolTestPanel.tsx Trims leading/trailing whitespace from string inputs before type-conversion and submission. Small, targeted change with no behavioural side-effects for non-string types.
tests/router_unit_tests/test_router_aresponses_streaming_fallback.py New unit test file covering all four Responses-API streaming fallback helpers. Tests are mock-only with no real network calls. Good coverage of partial-usage combining, continuation input, and passthrough paths.
tests/test_litellm/test_router.py Adds six new aresponses streaming fallback tests plus reformats an existing assertion block. All tests use mocks; no real network calls. Good regression coverage for metadata key routing, pre/post-chunk paths, and usage combining.

Reviews (1): Last reviewed commit: "Treat None litellm_provider as wildcard ..." | Re-trigger Greptile

Comment on lines 149 to 185

return final_request_params

def remove_cache_control_flag_from_input_and_tools(
self,
model: str, # allows overrides to selectively run this
input: Union[str, ResponseInputParam],
tools: Optional[List[ALL_RESPONSES_API_TOOL_PARAMS]] = None,
) -> Tuple[
Union[str, ResponseInputParam],
Optional[List[ALL_RESPONSES_API_TOOL_PARAMS]],
]:
"""Sibling of `remove_cache_control_flag_from_messages_and_tools` on
the chat path. Strips Anthropic-only `cache_control` markers from
Responses API input content blocks and tools.

`filter_value_from_dict` mutates each dict in place, so the same
objects are returned.
"""
from litellm.litellm_core_utils.prompt_templates.common_utils import (
filter_value_from_dict,
)

if isinstance(input, list):
for item in input:
if isinstance(item, dict):
filter_value_from_dict(cast(dict, item), "cache_control")

if tools is not None:
for tool in tools:
if isinstance(tool, dict):
filter_value_from_dict(cast(dict, tool), "cache_control")

return input, tools

def _validate_input_param(
self, input: Union[str, ResponseInputParam]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 In-place mutation of caller's input

filter_value_from_dict mutates each dict in-place, so the original input list/items passed by the caller are permanently modified before the request is even sent. If a higher-level caller holds a reference to the same input data and retries with a different provider (e.g. first tries OpenAI, falls back to Anthropic), the Anthropic provider will receive input that has already had its cache_control fields stripped. The docstring acknowledges the mutation but does not warn callers about this reuse hazard.

Comment thread litellm/router.py
Comment on lines +2386 to +2435
"must read as a seamless continuation."
),
}
],
},
{
"type": "message",
"role": "assistant",
"content": [{"type": "output_text", "text": generated_content}],
},
]
return cast("ResponseInputParam", base + continuation)

async def _aresponses_streaming_iterator(
self,
response: "BaseResponsesAPIStreamingIterator",
initial_kwargs: Dict[str, Any],
) -> "BaseResponsesAPIStreamingIterator":
"""
Wrap a Responses-API streaming iterator so MidStreamFallbackError
triggers the Router's fallback chain (parity with
_acompletion_streaming_iterator for the chat-completions path).

The Responses-API streaming path goes through
_ageneric_api_call_with_fallbacks rather than _acompletion, so the
returned iterator is never wrapped by the chat completions
fallback handler. Without this wrapper, MidStreamFallbackError
raised mid-stream from the underlying CustomStreamWrapper (used by
LiteLLMCompletionStreamingIterator when the Responses API is
served via the completion bridge) propagates unhandled and the
configured cross-provider fallback never fires.

Full parity with the chat-completions path:
- Pre-first-chunk: retry with the original input unchanged.
- Partial content: inject a developer instruction + prior
assistant message carrying the generated text so the fallback
model continues rather than restarts.
- Usage combining: merge partial-stream usage onto the fallback's
response.completed event so accounting reflects both attempts.
- Stream cleanup: shielded aclose() on both source and fallback
iterators on terminate.
"""
from litellm.exceptions import MidStreamFallbackError
from litellm.responses.streaming_iterator import (
BaseResponsesAPIStreamingIterator,
)

source_iterator = response

class FallbackResponsesStreamWrapper(BaseResponsesAPIStreamingIterator):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Manual attribute init in FallbackResponsesStreamWrapper may drift from base class

FallbackResponsesStreamWrapper intentionally bypasses super().__init__() and manually initialises every attribute BaseResponsesAPIStreamingIterator.__init__ would set (e.g. _failure_handled, _completed_response_cached, _persist_completed_response_before_logging, etc.). Any new attribute added to the base class in the future will silently be absent from the wrapper, which can lead to AttributeError in methods like _check_max_streaming_duration or _handle_failure that the proxy layer may call. A comment or assertion guarding the set of required attributes would help prevent future drift.

@Sameerlite

Copy link
Copy Markdown
Collaborator

Closing this in favour of #28582

@Sameerlite Sameerlite closed this May 22, 2026
@Sameerlite Sameerlite deleted the shin_agent_oss_staging_05_22_2026 branch May 22, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants