Skip to content

[litellm-agent] Staging → litellm_internal_staging (5/21/2026)#28432

Closed
oss-pr-review-agent-shin[bot] wants to merge 9 commits into
litellm_internal_stagingfrom
shin_agent_oss_staging_05_21_2026
Closed

[litellm-agent] Staging → litellm_internal_staging (5/21/2026)#28432
oss-pr-review-agent-shin[bot] wants to merge 9 commits into
litellm_internal_stagingfrom
shin_agent_oss_staging_05_21_2026

Conversation

@oss-pr-review-agent-shin

Copy link
Copy Markdown
Contributor

Automated staging PR created by litellm-agent.

This branch collects PRs approved by the agent on 5/21/2026.

⚠️ Human review required before CI. Convert from draft to ready when you've reviewed the diff.

TorvaldUtne and others added 9 commits May 19, 2026 02:38
…#27700)

Squash-merged by litellm-agent from TorvaldUtne's PR.
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
* feat(model_prices): add gemini-3.1-flash-lite pricing with standard/batch/flex/priority tiers

* fix pricing

* add service tier

---------

Co-authored-by: shin-berri <shin-laptop@berri.ai>
…dge (#28201)

* fix(anthropic): accept dict-shape reasoning_effort from Responses bridge

Issue #28196 — the Responses->Chat parser (transformation.py:184-200) keeps the full dict as reasoning_effort when summary is set; that branch was added in #25359. But the Anthropic transformation here still guarded on isinstance(value, str), silently dropping the param. Result: callers using the standard Reasoning(effort, summary) OpenAI-shaped object on Anthropic lose thinking entirely (0 reasoning_tokens, no thinking_blocks).

Coerce dict -> string before mapping. Same shape tolerance that gpt_5_transformation._normalize_reasoning_effort_for_chat_completion already implements. summary is irrelevant for Anthropic's thinking_blocks.

Adds two regression tests: one parametrized over string + dict shapes (with and without summary), one covering unparseable dict inputs (drops silently, no crash).

* test(anthropic): add non-adaptive model coverage for dict-shape reasoning_effort

Per Greptile feedback on PR #28198: the original regression test only exercised the adaptive (4.6+) path. Add a parametrized test for the non-adaptive branch (claude-sonnet-4-5) verifying that dict-shape reasoning_effort still maps to thinking.type='enabled' + budget_tokens, and that output_config is NOT set on pre-4.6 models.

* test(anthropic): convert unparseable-dict test to @pytest.mark.parametrize

Per @greptile-apps inline review on PR #28201 — matches the parametrize style of the two adjacent dict-shape tests and produces clearer failure messages (test ID per case instead of one collapsing for-loop).
…cks (#28215)

Squash-merged by litellm-agent from cwang-otto's PR.
…ng fallback (#28318)

Squash-merged by litellm-agent from cwang-otto's PR.
…PI requests (#28431)

Squash-merged by litellm-agent from cwang-otto's PR.
@oss-pr-review-agent-shin

Copy link
Copy Markdown
Contributor Author

@greptile please review

@greptile-apps

greptile-apps Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR bundles four independent improvements collected on 5/21/2026: mid-stream fallback support for the Responses API streaming path in Router, a fix for dict-shaped reasoning_effort being silently dropped in the Anthropic transformation, cache_control stripping for the OpenAI Responses API (mirroring the existing chat-completions path), and GA pricing entries for gemini-3.1-flash-lite across all three provider prefixes.

  • router.py: _aresponses_with_streaming_fallbacks wraps the returned iterator with _aresponses_streaming_iterator, which catches MidStreamFallbackError, injects a continuation prompt if partial content exists, merges partial-stream token usage onto the fallback's response.completed event, and shields cleanup with anyio.CancelScope. Full parity with _acompletion_streaming_iterator.
  • transformation.py (Anthropic): reasoning_effort branch now extracts the effort string from a dict shape (produced by the Responses→Chat bridge when summary is set on the reasoning field) before mapping, preventing silent disabling of extended thinking.
  • transformation.py (OpenAI responses): Adds remove_cache_control_flag_from_input_and_tools to strip Anthropic-only cache_control markers from Responses API input and tools, eliminating HTTP 400s from OpenAI for unknown parameters.

Confidence Score: 4/5

Safe to merge; the two observations in router.py are edge-case concerns that do not affect the primary streaming fallback path.

The FallbackResponsesStreamWrapper bypasses super().__init__() and manually mirrors every attribute, creating a maintenance coupling risk if the base class evolves. Additionally, when async_function_with_fallbacks_common_utils returns a non-iterable fallback in a stream=True context, the generator yields the raw ResponsesAPIResponse object as if it were a stream event rather than yielding None as the chat-completions fallback does — this could silently produce incorrect event-stream output in that specific scenario. Both concerns are unlikely to fire in practice given the existing test coverage.

litellm/router.py — specifically FallbackResponsesStreamWrapper.__init__ attribute mirroring and the non-streaming fallback yield branch inside stream_with_fallbacks.

Important Files Changed

Filename Overview
litellm/llms/anthropic/chat/transformation.py Fixes reasoning_effort handling to accept both string and dict shapes; the isinstance(value, str) guard was silently dropping dict-shaped values from the Responses→Chat bridge.
litellm/llms/openai/responses/transformation.py Adds remove_cache_control_flag_from_input_and_tools to strip Anthropic-only cache_control markers before sending to OpenAI's Responses API; mirrors the existing chat-completions path.
litellm/router.py Adds full mid-stream fallback support for the Responses API streaming path: _aresponses_streaming_iterator, _aresponses_with_streaming_fallbacks, and helpers for partial-usage extraction and continuation-input construction; large addition with minor edge-case gap.
tests/router_unit_tests/test_router_aresponses_streaming_fallback.py New mock-only unit-test file covering all four new Router helpers; no real network calls.
tests/test_litellm/test_router.py Adds mock tests for the aresponses streaming fallback path; reformats one existing assertion without changing its logic.
model_prices_and_context_window.json Adds GA pricing entries for gemini-3.1-flash-lite across vertex_ai, gemini/, and openrouter/google/ prefixes, completing the missing sibling entries for the stable variant.
tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py New parametrized tests covering string/dict-shaped reasoning_effort for both adaptive and non-adaptive Anthropic models, plus bad-value drop tests.
tests/test_litellm/llms/openai/responses/test_openai_responses_transformation.py New tests for cache_control stripping in input content blocks and tools; covers both the mutation path and the no-op path.
ui/litellm-dashboard/src/components/mcp_tools/ToolTestPanel.tsx Adds .trim() normalization to string inputs before type-conversion; change is applied consistently to all branches.
litellm/proxy/_lazy_openapi_snapshot.json Updates the example curl request body for POST /v1/agents in the OpenAPI snapshot; documentation change only.
tests/test_litellm/test_cost_calculator.py Adds a regression test asserting the openrouter/google/gemini-3.1-flash-lite pricing entry exists with correct costs.

Reviews (1): Last reviewed commit: "fix(openai-responses): strip Anthropic c..." | Re-trigger Greptile

Comment thread litellm/router.py
Comment on lines +2554 to +2555
else:
yield fallback_response

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-streaming fallback yields full response object as stream event

When async_function_with_fallbacks_common_utils returns a non-iterable (i.e., a completed ResponsesAPIResponse rather than a streaming iterator), the code yields the entire response object directly as a stream event. Any downstream consumer that expects events with a .type field (e.g., response.created, response.completed) will receive a ResponsesAPIResponse instead and likely produce an AttributeError or silently corrupt the event stream. The equivalent chat-completions fallback path yields None in this case, which is at minimum consistently neutral. Emitting a full response object here is more likely to cause subtle failures in production.

Comment thread litellm/router.py
Comment on lines +2435 to +2480
class FallbackResponsesStreamWrapper(BaseResponsesAPIStreamingIterator):
"""
Subclasses BaseResponsesAPIStreamingIterator only for isinstance
compatibility (proxy + interactions code paths check the type).
Bypasses the parent constructor and delegates iteration to an
async generator.
"""

def __init__(self, async_generator: AsyncGenerator):
import time

self._async_generator = async_generator
# Mirror every attribute BaseResponsesAPIStreamingIterator.__init__
# would have set. The wrapper bypasses super().__init__ (it has no
# httpx.Response of its own and no provider config to drive), so
# we copy from source_iterator where applicable and use safe
# defaults elsewhere. This keeps inherited methods (e.g.
# _check_max_streaming_duration, _handle_failure) safe to call.
self.response = source_iterator.response
self.model = source_iterator.model
self.logging_obj = source_iterator.logging_obj
self.finished = False
self.responses_api_provider_config = (
source_iterator.responses_api_provider_config
)
self.completed_response = None
self.start_time = source_iterator.start_time
self._failure_handled = False
self._completed_response_cached = False
self._completed_response_logged = False
self._completed_response_cache_hit = None
self._persist_completed_response_before_logging = True
self._stream_created_time = time.time()
self.litellm_metadata = source_iterator.litellm_metadata
self.custom_llm_provider = source_iterator.custom_llm_provider
self.request_data = source_iterator.request_data
self.call_type = source_iterator.call_type
# Preserve hidden params so response headers (model_id,
# api_base, additional_headers) keep flowing.
self._hidden_params = dict(source_iterator._hidden_params or {})

def __aiter__(self):
return self

async def __anext__(self):
return await self._async_generator.__anext__()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 FallbackResponsesStreamWrapper manually mirrors all parent attributes without calling super().__init__()

The constructor copies every attribute BaseResponsesAPIStreamingIterator.__init__ would set. Any future attribute added to the base class __init__ (e.g., a new flag or sub-object) will be silently absent from the wrapper, potentially causing AttributeError in inherited helper methods like _check_max_streaming_duration or _handle_failure that rely on those attributes. The docstring acknowledges the bypass but there is no enforced coupling. Consider at least adding a comment enumerating which version of BaseResponsesAPIStreamingIterator.__init__ this mirrors so reviewers know when to update it.

@Sameerlite

Copy link
Copy Markdown
Collaborator

https://github.com/BerriAI/litellm/pull/28542/commits

THis PR has the same commits, closing this

@Sameerlite Sameerlite closed this May 22, 2026
@Sameerlite Sameerlite deleted the shin_agent_oss_staging_05_21_2026 branch May 22, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants