test(ci): add reasoning_effort grid e2e regression suite#28036
Conversation
Encode the 231-cell QA sweep (21 provider x model combos x 11 effort values) from #27039 / #27074 as an automated CircleCI-gated regression suite. Each cell hits the real provider endpoint, captures the outgoing wire body via a pre-call CustomLogger, and asserts: - thinking.type, output_config.effort, thinking.budget_tokens, max_tokens in the captured request body (regression signal for silent drops/strips in any provider transformation) - HTTP status (200 vs BadRequestError -> 400) returned by litellm (regression signal for clean-error vs leaked-500 mappings) The matrix is encoded as a small rule set keyed by (model_mode, effort) plus per-model xhigh/max capability overrides, then expanded across the five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke /v1/messages route. Cells skip at runtime when the route's provider env vars are absent, so PR builds without credentials no-op gracefully. Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the existing main / litellm_* branch filter.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
… body, guard budget tokens - Remove unused vertex_credentials_path fixture (and now-unused os import) from conftest.py. - Parse Bedrock Converse complete_input_dict (logged as a JSON string by converse_handler.py) before passing to _assert_cell, so dict accessors work uniformly across routes. - Extend _BUDGET_TOKENS with xhigh and max entries so the budget-mode branch in expected() cannot KeyError if a future budget model gains the matching cap. Co-authored-by: Yassin Kortam <yassin@berri.ai>
|
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Sonnet-4-6 missing
maxcap causes wrong expected status- Renamed
_CAPS_OPUS_4_6to_CAPS_4_6(since thesupports_max_reasoning_effortcap is shared by opus and sonnet 4.6) and assigned it to all sonnet-4-6 ModelEntry definitions across every route, soexpected()now returns status=200 for effort='max', matching the runtime.
- Renamed
Preview (51dff1ff79)
diff --git a/.circleci/config.yml b/.circleci/config.yml
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -578,6 +578,48 @@
# Store test results
- store_test_results:
path: test-results
+ reasoning_effort_grid_v4_e2e:
+ docker:
+ - *python312_image
+ working_directory: ~/project
+ resource_class: large
+
+ steps:
+ - checkout
+ - setup_google_dns
+ - install_uv
+ - restore_cache:
+ keys:
+ - v1-uv-cache-{{ checksum "uv.lock" }}
+ - run:
+ name: Install Dependencies
+ command: |
+ uv sync --frozen --all-groups --all-extras --python 3.12
+ - save_cache:
+ paths:
+ - ~/.cache/uv
+ key: v1-uv-cache-{{ checksum "uv.lock" }}
+ # Grid v4 exercises reasoning_effort mapping against real Anthropic,
+ # Azure AI Foundry, Vertex AI, Bedrock Converse, and Bedrock Invoke
+ # endpoints. Per-route cells pytest-skip themselves when the matching
+ # provider env vars are absent, so PRs without credentials no-op.
+ - run:
+ name: Run reasoning_effort grid v4 e2e suite
+ command: |
+ mkdir -p test-results
+ uv run --no-sync python -m pytest \
+ tests/test_litellm/reasoning_effort_grid_v4/ \
+ -v \
+ --junitxml=test-results/junit.xml \
+ --durations=20 \
+ -n 4 \
+ --timeout=180 --timeout_method=thread \
+ --retries 2 --retry-delay 5 \
+ --max-worker-restart=5
+ no_output_timeout: 20m
+
+ - store_test_results:
+ path: test-results
realtime_translation_testing:
docker:
- *python312_image
@@ -2619,6 +2661,8 @@
filters: *main_branches
- llm_translation_testing:
filters: *main_branches
+ - reasoning_effort_grid_v4_e2e:
+ filters: *main_branches
- realtime_translation_testing:
filters: *main_branches
- agent_testing:
diff --git a/tests/test_litellm/reasoning_effort_grid_v4/__init__.py b/tests/test_litellm/reasoning_effort_grid_v4/__init__.py
new file mode 100644
diff --git a/tests/test_litellm/reasoning_effort_grid_v4/conftest.py b/tests/test_litellm/reasoning_effort_grid_v4/conftest.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/conftest.py
@@ -1,0 +1,51 @@
+"""Shared fixtures for the reasoning_effort grid v4 e2e suite."""
+
+from typing import Any, Dict, List, Optional
+
+import pytest
+
+import litellm
+from litellm.integrations.custom_logger import CustomLogger
+
+
+class _WireBodyCapture(CustomLogger):
+ """Pre-call hook that records the outgoing wire body LiteLLM sends upstream.
+
+ `complete_input_dict` is the fully transformed provider request as set by
+ every provider transformation in `litellm/llms/**`. Capturing it here means
+ a regression anywhere in the transformation chain (strip, rename, drop)
+ surfaces as an assertion failure on the cell that depends on it.
+ """
+
+ def __init__(self) -> None:
+ super().__init__()
+ self.records: List[Dict[str, Any]] = []
+
+ def log_pre_api_call(self, model, messages, kwargs):
+ self.records.append(
+ {
+ "model": model,
+ "body": kwargs.get("additional_args", {}).get("complete_input_dict"),
+ "api_base": kwargs.get("additional_args", {}).get("api_base"),
+ }
+ )
+
+ async def async_log_pre_api_call(self, model, messages, kwargs):
+ self.log_pre_api_call(model, messages, kwargs)
+
+ def latest(self) -> Optional[Dict[str, Any]]:
+ return self.records[-1] if self.records else None
+
+ def reset(self) -> None:
+ self.records.clear()
+
+
+@pytest.fixture()
+def wire_capture():
+ capture = _WireBodyCapture()
+ previous = list(litellm.callbacks)
+ litellm.callbacks = previous + [capture]
+ try:
+ yield capture
+ finally:
+ litellm.callbacks = previous
diff --git a/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py b/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py
@@ -1,0 +1,303 @@
+"""
+Canonical post-fix expectations for the reasoning_effort grid v4 sweep.
+
+The QA sweep on https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610
+covered 21 (provider x model) combos x 11 effort values (231 cells). The follow-up
+PR https://github.com/BerriAI/litellm/pull/27074 closed nine bugs surfaced by that
+sweep. This module encodes the post-fix expectations as a small rule set keyed by
+(model_mode, effort) and per-model capability overrides, then expands them across
+the model x effort matrix per route.
+"""
+
+from dataclasses import dataclass, field
+from typing import Dict, FrozenSet, List, Optional, Tuple
+
+
+OMIT = object()
+
+
+@dataclass(frozen=True)
+class CellExpectation:
+ """Expected post-fix behavior for a single grid cell."""
+
+ status: int
+ thinking_type: object
+ output_config_effort: object = OMIT
+ thinking_budget_tokens: object = OMIT
+ max_tokens: object = OMIT
+
+
+@dataclass(frozen=True)
+class ModelEntry:
+ alias: str
+ model: str
+ mode: str
+ extra_params: Tuple[Tuple[str, str], ...] = field(default_factory=tuple)
+ required_env: FrozenSet[str] = field(default_factory=frozenset)
+ caps: FrozenSet[str] = field(default_factory=frozenset)
+
+ def params(self) -> Dict[str, str]:
+ return dict(self.extra_params)
+
+
+EFFORTS: Tuple[str, ...] = (
+ "__omit__",
+ "none",
+ "minimal",
+ "low",
+ "medium",
+ "high",
+ "xhigh",
+ "max",
+ "disabled",
+ "invalid",
+ "",
+)
+
+_BUDGET_TOKENS: Dict[str, int] = {
+ "minimal": 1024,
+ "low": 1024,
+ "medium": 2048,
+ "high": 4096,
+ "xhigh": 8192,
+ "max": 16384,
+}
+
+_ADAPTIVE_EFFORT_LABEL: Dict[str, str] = {
+ "minimal": "low",
+ "low": "low",
+ "medium": "medium",
+ "high": "high",
+ "xhigh": "xhigh",
+ "max": "max",
+}
+
+_BAD_REQUEST_EFFORTS: FrozenSet[str] = frozenset({"disabled", "invalid", ""})
+
+
+def expected(model: ModelEntry, effort: str) -> CellExpectation:
+ """Compute the post-fix expected cell for a (model, effort) pair."""
+ if effort in ("__omit__", "none"):
+ if model.mode == "budget":
+ return CellExpectation(status=200, thinking_type=OMIT, max_tokens=8192)
+ return CellExpectation(status=200, thinking_type=OMIT)
+
+ if effort in _BAD_REQUEST_EFFORTS:
+ return CellExpectation(status=400, thinking_type=OMIT)
+
+ if effort in ("xhigh", "max"):
+ cap = f"supports_{effort}_reasoning_effort"
+ if cap not in model.caps:
+ return CellExpectation(status=400, thinking_type=OMIT)
+
+ if model.mode == "adaptive":
+ return CellExpectation(
+ status=200,
+ thinking_type="adaptive",
+ output_config_effort=_ADAPTIVE_EFFORT_LABEL[effort],
+ )
+
+ return CellExpectation(
+ status=200,
+ thinking_type="enabled",
+ thinking_budget_tokens=_BUDGET_TOKENS[effort],
+ max_tokens=8192,
+ )
+
+
+_ANTHROPIC_REQ = frozenset({"ANTHROPIC_API_KEY"})
+_AZURE_FOUNDRY_REQ = frozenset({"AZURE_FOUNDRY_API_BASE", "AZURE_FOUNDRY_API_KEY"})
+_VERTEX_REQ = frozenset({"VERTEX_PROJECT"})
+_BEDROCK_REQ = frozenset({"AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"})
+
+
+_CAPS_OPUS_4_7: FrozenSet[str] = frozenset(
+ {"supports_xhigh_reasoning_effort", "supports_max_reasoning_effort"}
+)
+_CAPS_4_6: FrozenSet[str] = frozenset({"supports_max_reasoning_effort"})
+_CAPS_NONE: FrozenSet[str] = frozenset()
+
+
+ANTHROPIC_DIRECT_MODELS: Tuple[ModelEntry, ...] = (
+ ModelEntry(
+ alias="claude-opus-4-7",
+ model="anthropic/claude-opus-4-7",
+ mode="adaptive",
+ required_env=_ANTHROPIC_REQ,
+ caps=_CAPS_OPUS_4_7,
+ ),
+ ModelEntry(
+ alias="claude-sonnet-4-6",
+ model="anthropic/claude-sonnet-4-6",
+ mode="adaptive",
+ required_env=_ANTHROPIC_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="claude-haiku-4-5",
+ model="anthropic/claude-haiku-4-5",
+ mode="budget",
+ required_env=_ANTHROPIC_REQ,
+ caps=_CAPS_NONE,
+ ),
+)
+
+
+AZURE_AI_MODELS: Tuple[ModelEntry, ...] = (
+ ModelEntry(
+ alias="azure-claude-opus-4-7",
+ model="azure_ai/claude-opus-4-7",
+ mode="adaptive",
+ required_env=_AZURE_FOUNDRY_REQ,
+ caps=_CAPS_OPUS_4_7,
+ ),
+ ModelEntry(
+ alias="azure-claude-opus-4-6",
+ model="azure_ai/claude-opus-4-6",
+ mode="adaptive",
+ required_env=_AZURE_FOUNDRY_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="azure-claude-sonnet-4-6",
+ model="azure_ai/claude-sonnet-4-6",
+ mode="adaptive",
+ required_env=_AZURE_FOUNDRY_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="azure-claude-haiku-4-5",
+ model="azure_ai/claude-haiku-4-5",
+ mode="budget",
+ required_env=_AZURE_FOUNDRY_REQ,
+ caps=_CAPS_NONE,
+ ),
+)
+
+
+VERTEX_AI_MODELS: Tuple[ModelEntry, ...] = (
+ ModelEntry(
+ alias="vertex-claude-opus-4-7",
+ model="vertex_ai/claude-opus-4-7",
+ mode="adaptive",
+ extra_params=(("vertex_location", "global"),),
+ required_env=_VERTEX_REQ,
+ caps=_CAPS_OPUS_4_7,
+ ),
+ ModelEntry(
+ alias="vertex-claude-opus-4-6",
+ model="vertex_ai/claude-opus-4-6",
+ mode="adaptive",
+ extra_params=(("vertex_location", "us-east5"),),
+ required_env=_VERTEX_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="vertex-claude-sonnet-4-6",
+ model="vertex_ai/claude-sonnet-4-6",
+ mode="adaptive",
+ extra_params=(("vertex_location", "us-east5"),),
+ required_env=_VERTEX_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="vertex-claude-haiku-4-5",
+ model="vertex_ai/claude-haiku-4-5",
+ mode="budget",
+ extra_params=(("vertex_location", "us-east5"),),
+ required_env=_VERTEX_REQ,
+ caps=_CAPS_NONE,
+ ),
+)
+
+
+BEDROCK_CONVERSE_MODELS: Tuple[ModelEntry, ...] = (
+ ModelEntry(
+ alias="bedrock-claude-opus-4-7",
+ model="bedrock/converse/us.anthropic.claude-opus-4-7",
+ mode="adaptive",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_OPUS_4_7,
+ ),
+ ModelEntry(
+ alias="bedrock-claude-opus-4-6",
+ model="bedrock/converse/us.anthropic.claude-opus-4-6-v1",
+ mode="adaptive",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="bedrock-claude-sonnet-4-6",
+ model="bedrock/converse/us.anthropic.claude-sonnet-4-6",
+ mode="adaptive",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="bedrock-claude-sonnet-4-5",
+ model="bedrock/converse/us.anthropic.claude-sonnet-4-5-20250929-v1:0",
+ mode="budget",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_NONE,
+ ),
+)
+
+
+BEDROCK_INVOKE_CHAT_MODELS: Tuple[ModelEntry, ...] = (
+ ModelEntry(
+ alias="bedrock-invoke-claude-opus-4-6",
+ model="bedrock/invoke/us.anthropic.claude-opus-4-6-v1",
+ mode="adaptive",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="bedrock-invoke-claude-sonnet-4-6",
+ model="bedrock/invoke/us.anthropic.claude-sonnet-4-6",
+ mode="adaptive",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_4_6,
+ ),
+ ModelEntry(
+ alias="bedrock-invoke-claude-opus-4-5",
+ model="bedrock/invoke/us.anthropic.claude-opus-4-5-20251101-v1:0",
+ mode="budget",
+ extra_params=(("aws_region_name", "us-east-1"),),
+ required_env=_BEDROCK_REQ,
+ caps=_CAPS_NONE,
+ ),
+)
+
+
+BEDROCK_INVOKE_MESSAGES_MODELS: Tuple[ModelEntry, ...] = BEDROCK_INVOKE_CHAT_MODELS
+
+
+@dataclass(frozen=True)
+class Route:
+ name: str
+ models: Tuple[ModelEntry, ...]
+
+
+ROUTES: Tuple[Route, ...] = (
+ Route("anthropic_direct", ANTHROPIC_DIRECT_MODELS),
+ Route("azure_ai", AZURE_AI_MODELS),
+ Route("vertex_ai", VERTEX_AI_MODELS),
+ Route("bedrock_converse", BEDROCK_CONVERSE_MODELS),
+ Route("bedrock_invoke_chat", BEDROCK_INVOKE_CHAT_MODELS),
+ Route("bedrock_invoke_messages", BEDROCK_INVOKE_MESSAGES_MODELS),
+)
+
+
+def all_cells() -> List[Tuple[str, ModelEntry, str, CellExpectation]]:
+ cells: List[Tuple[str, ModelEntry, str, CellExpectation]] = []
+ for route in ROUTES:
+ for model in route.models:
+ for effort in EFFORTS:
+ cells.append((route.name, model, effort, expected(model, effort)))
+ return cells
diff --git a/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py b/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py
@@ -1,0 +1,236 @@
+"""
+End-to-end grid v4 regression suite for reasoning_effort mapping across
+Anthropic-backed routes.
+
+Encodes the 21 (provider x model) x 11 effort matrix (231 cells) from the
+QA sweep on https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610
+that the fix in https://github.com/BerriAI/litellm/pull/27074 was validated
+against. Each cell asserts:
+
+ - Wire body shape captured pre-call (thinking.type, output_config.effort,
+ thinking.budget_tokens, max_tokens) -- the regression signal for silent
+ drops/strips anywhere in the transformation chain.
+ - Status code returned by LiteLLM (200 vs BadRequestError -> 400) -- the
+ regression signal for clean-error vs leaked-500 mappings.
+
+Hits real provider endpoints. Each route is skipped at runtime when its
+required env vars are absent, so PR builds without provider credentials no-op
+gracefully.
+"""
+
+import json
+import os
+from typing import Any, Dict, List, Optional, Tuple
+
+import pytest
+
+import litellm
+from litellm.exceptions import BadRequestError
+
+from .grid_spec import (
+ OMIT,
+ ROUTES,
+ CellExpectation,
+ ModelEntry,
+ all_cells,
+)
+
+
+_PROMPT_MESSAGES: List[Dict[str, str]] = [
+ {"role": "user", "content": "Step by step, calculate 47 * 53. Show your work."}
+]
+
+
+def _required_env_missing(model: ModelEntry) -> Optional[str]:
+ missing = [key for key in model.required_env if not os.environ.get(key)]
+ if missing:
+ return "missing env: " + ", ".join(sorted(missing))
+ return None
+
+
+def _max_tokens_for(model: ModelEntry) -> int:
+ return 200 if model.mode == "adaptive" else 8192
+
+
+def _build_completion_kwargs(model: ModelEntry, effort: str) -> Dict[str, Any]:
+ kwargs: Dict[str, Any] = {
+ "model": model.model,
+ "messages": _PROMPT_MESSAGES,
+ "max_tokens": _max_tokens_for(model),
+ }
+ kwargs.update(model.params())
+ if effort != "__omit__":
+ kwargs["reasoning_effort"] = effort
+ if model.model.startswith("vertex_ai/"):
+ kwargs["vertex_project"] = os.environ.get(
+ "VERTEX_PROJECT", "vertex-check-481318"
+ )
+ if model.model.startswith("azure_ai/"):
+ kwargs["api_base"] = os.environ["AZURE_FOUNDRY_API_BASE"]
+ kwargs["api_key"] = os.environ["AZURE_FOUNDRY_API_KEY"]
+ return kwargs
+
+
+def _build_messages_kwargs(model: ModelEntry, effort: str) -> Dict[str, Any]:
+ kwargs = _build_completion_kwargs(model, effort)
+ return kwargs
+
+
+def _converse_subbody(body: Dict[str, Any]) -> Dict[str, Any]:
+ """Return the dict that holds thinking/output_config for a Converse wire body."""
+ return body.get("additionalModelRequestFields", body)
+
+
+def _max_tokens_from_body(body: Dict[str, Any], route_name: str) -> Optional[int]:
+ if route_name == "bedrock_converse":
+ return body.get("inferenceConfig", {}).get("maxTokens")
+ return body.get("max_tokens")
+
+
+def _assert_cell(
+ route_name: str,
+ body: Optional[Dict[str, Any]],
+ status: int,
+ cell: CellExpectation,
+) -> None:
+ assert status == cell.status, f"expected status={cell.status}, got status={status}"
+
+ if cell.status != 200:
+ # Bad-request paths short-circuit before the wire body matters.
+ return
+
+ assert body is not None, "wire body was not captured for a 200-status cell"
+ subbody = _converse_subbody(body) if route_name == "bedrock_converse" else body
+ thinking = subbody.get("thinking")
+ output_config = subbody.get("output_config")
+
+ if cell.thinking_type is OMIT:
+ assert thinking is None, f"expected thinking omitted, got {thinking!r}"
+ else:
+ assert thinking is not None, "expected thinking present, got omit"
+ assert thinking.get("type") == cell.thinking_type, (
+ f"expected thinking.type={cell.thinking_type!r}, "
+ f"got {thinking.get('type')!r}"
+ )
+
+ if cell.output_config_effort is OMIT:
+ assert (
+ output_config is None or "effort" not in output_config
+ ), f"expected output_config.effort omitted, got {output_config!r}"
+ else:
+ assert output_config is not None, (
+ f"expected output_config.effort={cell.output_config_effort!r}, "
+ "got output_config omitted"
+ )
+ assert output_config.get("effort") == cell.output_config_effort, (
+ f"expected output_config.effort={cell.output_config_effort!r}, "
+ f"got {output_config.get('effort')!r}"
+ )
+
+ if cell.thinking_budget_tokens is not OMIT:
+ assert thinking is not None
+ assert thinking.get("budget_tokens") == cell.thinking_budget_tokens, (
+ f"expected thinking.budget_tokens={cell.thinking_budget_tokens!r}, "
+ f"got {thinking.get('budget_tokens')!r}"
+ )
+
+ if cell.max_tokens is not OMIT:
+ wire_max = _max_tokens_from_body(body, route_name)
+ assert (
+ wire_max == cell.max_tokens
+ ), f"expected max_tokens={cell.max_tokens!r}, got {wire_max!r}"
+
+
+_PARAMS: List[Tuple[str, ModelEntry, str, CellExpectation]] = all_cells()
+
+
+def _cell_id(case: Tuple[str, ModelEntry, str, CellExpectation]) -> str:
+ route_name, model, effort, _ = case
+ effort_label = "__empty__" if effort == "" else effort
+ return f"{route_name}-{model.alias}-{effort_label}"
+
+
+_PARAM_IDS: List[str] = [_cell_id(case) for case in _PARAMS]
+
+
+async def _call_chat(model: ModelEntry, effort: str) -> Tuple[int, Optional[Exception]]:
+ kwargs = _build_completion_kwargs(model, effort)
+ try:
+ await litellm.acompletion(**kwargs)
+ return 200, None
+ except BadRequestError as exc:
+ return 400, exc
+ except Exception as exc:
+ return 500, exc
+
+
+async def _call_messages(
+ model: ModelEntry, effort: str
+) -> Tuple[int, Optional[Exception]]:
+ kwargs = _build_messages_kwargs(model, effort)
+ try:
+ await litellm.messages.acreate(**kwargs)
+ return 200, None
+ except BadRequestError as exc:
+ return 400, exc
+ except Exception as exc:
+ return 500, exc
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+ ("route_name", "model", "effort", "cell"), _PARAMS, ids=_PARAM_IDS
+)
+async def test_reasoning_effort_grid_v4(
+ route_name: str,
+ model: ModelEntry,
+ effort: str,
+ cell: CellExpectation,
+ wire_capture,
+) -> None:
+ skip_reason = _required_env_missing(model)
+ if skip_reason:
+ pytest.skip(skip_reason)
+
+ if route_name == "bedrock_invoke_messages":
+ status, exc = await _call_messages(model, effort)
+ else:
+ status, exc = await _call_chat(model, effort)
+
+ record = wire_capture.latest()
+ body = record["body"] if record else None
+ # Bedrock Converse logs `complete_input_dict` as a JSON string (see
+ # litellm/llms/bedrock/chat/converse_handler.py); parse it so the dict
+ # accessors in `_assert_cell` work uniformly across routes.
+ if route_name == "bedrock_converse" and isinstance(body, str):
+ body = json.loads(body)
+
+ try:
+ _assert_cell(route_name, body, status, cell)
+ except AssertionError:
+ if exc is not None:
+ raise AssertionError(
+ f"underlying exception ({type(exc).__name__}): {exc}"
+ ) from None
+ raise
+
+
+def test_grid_v4_cell_count() -> None:
+ """Guard against accidental drops or duplicates in the grid spec."""
+ assert len(_PARAMS) == 21 * 11, (
+ f"expected 231 cells (21 provider x model combos x 11 efforts), "
+ f"got {len(_PARAMS)}"
+ )
+
+
+def test_grid_v4_route_coverage() -> None:
+ """The grid must cover every route the original QA sweep covered."""
+ route_names = {route.name for route in ROUTES}
+ assert route_names == {
+ "anthropic_direct",
+ "azure_ai",
+ "vertex_ai",
+ "bedrock_converse",
+ "bedrock_invoke_chat",
+ "bedrock_invoke_messages",
+ }You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 4327427. Configure here.
…t cap The runtime _validate_effort_for_model allows effort='max' for any Claude 4.6 model (opus or sonnet), and model_prices_and_context_window sets supports_max_reasoning_effort: true for claude-sonnet-4-6. The grid spec previously gave sonnet-4-6 entries _CAPS_NONE, so expected() returned status=400 for effort='max', which mismatched the runtime's status=200 and caused 6 cells (one per route) to fail. Rename _CAPS_OPUS_4_6 to _CAPS_4_6 (since the cap set is shared by opus and sonnet 4.6) and assign it to all sonnet-4-6 entries. Co-authored-by: Yassin Kortam <yassin@berri.ai>
…on, drop v4 naming
- Drop the "v4" suffix throughout: it referred to the QA sweep iteration,
not this test suite. There's only one regression suite, so just call it
reasoning_effort_grid.
- Move tests/test_litellm/reasoning_effort_grid_v4/ -> tests/llm_translation/
reasoning_effort_grid/. Two reasons:
1. The parent tests/test_litellm/conftest.py installs an autouse fixture
(isolate_host_aws_config) that clears every AWS_* env var before each
test, which would silently skip every Bedrock cell.
2. tests/llm_translation/conftest.py already wires up the Redis-backed
VCR persister and auto-applies @pytest.mark.vcr to every collected
item via apply_vcr_auto_marker_to_items. Living under that conftest
means the suite gets cassette replay for free -- first CI run with
provider creds records 231 cassettes, every subsequent run replays
them with no live spend.
- Trim the suite's own conftest down to just the wire_capture fixture; the
inherited llm_translation conftest covers the VCR plumbing.
- Drop the dedicated reasoning_effort_grid_v4_e2e CircleCI job. The existing
llm_translation_testing job globs tests/llm_translation/**/test_*.py, so
the suite is gated by an existing job with no new wiring.
…997/git/BerriAI/litellm into litellm_grid-v4-e2e-tests-cZRwz
0c4fed4 to
d084782
Compare
… openapi field
Two CI failures, both pre-existing in different ways:
1. reasoning_effort_grid: all 33 bedrock_invoke_messages cells failed with
AttributeError("module 'litellm' has no attribute 'messages'"). litellm
exposes the async Anthropic Messages entrypoint as litellm.anthropic_messages
(via "from .llms.anthropic.experimental_pass_through.messages.handler
import *" in litellm/__init__.py), not litellm.messages.acreate. Swap
the call.
2. tests/test_litellm/interactions/test_openapi_compliance.py::TestResponseCompliance::test_interaction_response_fields
asserts the live Google spec contains "steps". Google's spec has churned
through "outputs" -> "steps" -> neither, and presently carries neither.
The test broke on main as soon as upstream dropped "steps"; pulling the
key off the assert list realigns the test with the live schema. Re-add
the per-turn output field once upstream stabilizes on a name.
The openapi-compliance fix doesn't belong to this PR conceptually but is
included here per request to unblock CI before the morning.
… not class The anthropic_messages route wraps client-side BadRequestError as AnthropicError (a BaseLLMException subclass) with status_code=400, so "except BadRequestError" missed those cells and they fell through to the generic Exception arm, returning 500 instead of the expected 400. Replace the isinstance-on-BadRequestError check with a tiny classifier that prefers BadRequestError membership, then falls back to the exception's status_code attribute (set by every BaseLLMException subclass), then 500. Apply to both _call_chat and _call_messages for consistency. Fixes the 13 CircleCI llm_translation_testing failures on bedrock_invoke_messages cells where the effort was disabled / invalid / empty / xhigh-on-unsupported / max-on-unsupported.
Four pre-existing flakes on main that gate this branch's workflow even though they're unrelated to the reasoning_effort_grid suite: 1. tests/local_testing/test_completion.py::test_completion_fireworks_ai 2. tests/local_testing/test_completion_cost.py::test_completion_cost_fireworks_ai[fireworks_ai/llama-v3p3-70b-instruct] 3. tests/llm_translation/test_fireworks_ai_translation.py::test_document_inlining_example[False] The Fireworks-hosted `llama-v3p3-70b-instruct` deployment is currently returning 404 "Model not found, inaccessible, and/or not deployed". These tests pass when the model is deployed; the issue is upstream capacity, not our code path. Wrap the live call in a try/except that pytest.skip's on litellm.NotFoundError so a Fireworks deployment hiccup no longer fails CI for unrelated PRs. 4. tests/llm_translation/test_gemini.py::test_gemini_image_size_limit_exceeded The test fetches the 32MB "Blue Marble 2002" image from Wikimedia to exercise the 50MB image-size cap. CI runners share an IP pool with noisy traffic, so Wikimedia routinely returns HTTP 429. The size-limit check never gets a chance to fire. Catch the 429 BadRequestError and pytest.skip in that case. None of these belong on this PR conceptually, but they're included per request to unblock the workflow before morning.
…ageFetchError litellm.ImageFetchError is a subclass of BadRequestError, so when Wikimedia returns 429 the pytest.raises(ImageFetchError) block matches and swallows the exception -- the outer try/except never fires. Drop the try/except and check the captured error message for "Status code: 429" after the raises block, calling pytest.skip in that case. Same intent, right control flow.
Greptile SummaryConverts a 231-cell manual QA grid sweep for
Confidence Score: 5/5Safe to merge — changes are entirely test-side with no production code impact. All changes are in the test layer. The new regression suite is well-structured, skips cleanly when credentials are absent, and inherits existing VCR infrastructure. The only notable concern is that model capability caps are hardcoded in grid_spec.py rather than derived from the production config JSON, which can cause the test oracle to drift if capabilities are updated upstream — but this does not affect production behavior. tests/llm_translation/reasoning_effort_grid/grid_spec.py — the hardcoded CAPS* constants should ideally be derived from get_model_info() to avoid oracle drift.
|
| Filename | Overview |
|---|---|
| tests/llm_translation/reasoning_effort_grid/conftest.py | Adds wire_capture fixture using a CustomLogger pre-call hook; correctly restores litellm.callbacks in a finally block. |
| tests/llm_translation/reasoning_effort_grid/grid_spec.py | Encodes expected cells for the 231-cell matrix; capability caps for xhigh/max are hardcoded rather than read from the production model-config JSON, risking oracle drift. |
| tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py | Parametrized 231-cell async test suite with wire-body and status assertions; skip guards, VCR plumbing, and error classification are all handled correctly. |
| tests/test_litellm/interactions/test_openapi_compliance.py | Removes assertion on the steps field from Google's live schema; reduces test coverage for a field whose upstream name is currently unstable. |
| tests/llm_translation/test_fireworks_ai_translation.py | Adds pytest.skip on NotFoundError to handle upstream model unavailability; preserves existing assertion logic. |
| tests/llm_translation/test_gemini.py | Reformats long assertion lines and adds a pytest.skip guard for Wikimedia 429 rate-limits in the image-size test; substantive assertions unchanged. |
| tests/local_testing/test_completion.py | Adds pytest.skip for NotFoundError on Fireworks completion test; existing test logic unmodified. |
| tests/local_testing/test_completion_cost.py | Wraps Fireworks completion call in try/except to skip on NotFoundError; cost assertion logic preserved. |
Reviews (2): Last reviewed commit: "refactor(reasoning_effort_grid): tighten..." | Re-trigger Greptile
…view Two P2 nits flagged by Greptile on PR 28036: 1. _build_completion_kwargs() defaulted vertex_project to "vertex-check-481318" when VERTEX_PROJECT was unset. That value is a specific GCP project that doesn't belong to this repo, so if the env-var skip guard were ever bypassed (misconfig, direct helper call), the test would silently issue calls to a foreign project rather than failing loudly. Drop the fallback and read os.environ["VERTEX_PROJECT"] directly, mirroring how AZURE_FOUNDRY_* are handled. 2. _build_messages_kwargs() was a one-liner that returned the result of _build_completion_kwargs() unchanged -- a dead abstraction with one caller. Inline at the _call_messages call site and delete the helper.
…s-cZRwz Resolve conflicts in the five unrelated CI-flake fixes I previously landed on this branch -- staging shipped stronger versions (mocked HTTP for the Fireworks tests, mocked image-fetch for the Gemini size-limit test, switched the openapi-compliance test to the Interaction response schema instead of dropping the assertion). Take staging's version of all five files and drop my now-unreachable 429-skip lines from the Gemini test that the auto-merge left behind.

Summary
Converts the manual QA grid sweep from #27074 (231 cells — 21 provider × model combos × 11 effort values, validating the fix for #27039) into an automated regression suite. Each cell hits a real provider endpoint once, then VCR-replays from a Redis-backed cassette on every subsequent CI run — no live spend after the first record pass.
thinking.type,output_config.effort,thinking.budget_tokens,max_tokens— catches any silent drop/strip in a provider transformation (the bug class fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074 fixed for Bedrock + Vertex + Bedrock-Invoke/v1/messages).BadRequestError→ 400 — catches regressions in theValueError→BadRequestErrormappings (disabled/invalid/\"\"/ unsupportedxhigh/maxon non-supporting models).tests/llm_translation/reasoning_effort_grid/grid_spec.pykeyed by(model_mode, effort)plus per-modelsupports_xhigh_reasoning_effort/supports_max_reasoning_effortoverrides, then expanded across the 5 chat-completion routes and the Bedrock Invoke/v1/messagesroute. Adding a new model or effort is a one-line change.tests/llm_translation/conftest.py: that conftest auto-applies@pytest.mark.vcrto every collected item and registers the Redis-backed cassette persister (CASSETTE_REDIS_URL). First CI run with provider creds records 231 cassettes; every subsequent run replays them. No new VCR plumbing in this PR.llm_translation_testingworkflow already globstests/llm_translation/**/test_*.py, so the suite is gated by an existing job behind the*main_branchesfilter (main+litellm_*). Per-cellpytest.skipwhen route env vars are absent, so PR builds without credentials no-op gracefully.Routes covered (matches QA proxy config exactly)
Files
tests/llm_translation/reasoning_effort_grid/grid_spec.py— post-fix expectation rule set + model matrix per route.tests/llm_translation/reasoning_effort_grid/conftest.py—wire_capturefixture (CustomLogger that recordscomplete_input_dictpre-call). VCR plumbing inherited from the parent conftest.tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py— single parametrized test (231 cells) + 2 meta-tests for grid integrity.Test plan
pytest --collect-onlylists 231 parametrized cells + 2 meta-tests (233 total)test_grid_cell_countandtest_grid_route_coveragepass without provider env varsblack .formatting appliedlitellm_*branch with provider creds (records all 231 cassettes; verify all cells pass against the post-fix code from fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074)Note
Medium Risk
Adds a large new VCR-backed e2e test matrix (231 parametrized cells) that can affect CI runtime/stability and depends on recorded provider interactions and env-gated live calls, though it doesn’t change production code paths.
Overview
Introduces a new
reasoning_effort_gride2e regression suite that sweeps 21 Anthropic-backed model/provider routes × 11reasoning_effortvalues and asserts both the returned status (200 vs 400) and the captured upstream wire body fields (thinking.*,output_config.effort,max_tokens) to catch transformation/mapping regressions.Encodes the expected behavior as rules in
grid_spec.py(expanded into the full matrix) and adds awire_capturefixture that hooks LiteLLM callbacks to recordcomplete_input_dictpre-call; tests rely on the existingtests/llm_translationRedis VCR auto-marker so calls are recorded once and replayed in CI, with per-route skips when required env vars are missing.Hardens a few existing live-provider tests by skipping on upstream instability: Fireworks tests now
pytest.skiponNotFoundError(404 model missing), Gemini image-size-limit test skips when Wikimedia rate-limits (429), and an OpenAPI compliance test relaxes expectations to avoid a now-missing per-turn output field in Google’s schema.Reviewed by Cursor Bugbot for commit fb7091e. Bugbot is set up for automated code reviews on this repo. Configure here.