test(ci): add reasoning_effort grid e2e regression suite by mateo-berri · Pull Request #28036 · BerriAI/litellm

mateo-berri · 2026-05-16T02:47:18Z

Summary

Converts the manual QA grid sweep from #27074 (231 cells — 21 provider × model combos × 11 effort values, validating the fix for #27039) into an automated regression suite. Each cell hits a real provider endpoint once, then VCR-replays from a Redis-backed cassette on every subsequent CI run — no live spend after the first record pass.

Wire body assertions per cell: thinking.type, output_config.effort, thinking.budget_tokens, max_tokens — catches any silent drop/strip in a provider transformation (the bug class fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074 fixed for Bedrock + Vertex + Bedrock-Invoke /v1/messages).
Status assertions per cell: 200 vs BadRequestError → 400 — catches regressions in the ValueError → BadRequestError mappings (disabled / invalid / \"\" / unsupported xhigh / max on non-supporting models).
Grid encoded as a rule set, not 231 hardcoded cells: post-fix expectations live in tests/llm_translation/reasoning_effort_grid/grid_spec.py keyed by (model_mode, effort) plus per-model supports_xhigh_reasoning_effort / supports_max_reasoning_effort overrides, then expanded across the 5 chat-completion routes and the Bedrock Invoke /v1/messages route. Adding a new model or effort is a one-line change.
VCR via existing tests/llm_translation/conftest.py: that conftest auto-applies @pytest.mark.vcr to every collected item and registers the Redis-backed cassette persister (CASSETTE_REDIS_URL). First CI run with provider creds records 231 cassettes; every subsequent run replays them. No new VCR plumbing in this PR.
CI gating: no new job — the existing llm_translation_testing workflow already globs tests/llm_translation/**/test_*.py, so the suite is gated by an existing job behind the *main_branches filter (main + litellm_*). Per-cell pytest.skip when route env vars are absent, so PR builds without credentials no-op gracefully.

Routes covered (matches QA proxy config exactly)

Route	Models	Cells
Anthropic direct	opus-4-7, sonnet-4-6, haiku-4-5	33
Azure AI Foundry	opus-4-7, opus-4-6, sonnet-4-6, haiku-4-5	44
Vertex AI	opus-4-7, opus-4-6, sonnet-4-6, haiku-4-5	44
Bedrock Converse	opus-4-7, opus-4-6, sonnet-4-6, sonnet-4-5	44
Bedrock Invoke /chat	opus-4-6, sonnet-4-6, opus-4-5	33
Bedrock Invoke /v1/messages	opus-4-6, sonnet-4-6, opus-4-5	33
Total	21	231

Files

tests/llm_translation/reasoning_effort_grid/grid_spec.py — post-fix expectation rule set + model matrix per route.
tests/llm_translation/reasoning_effort_grid/conftest.py — wire_capture fixture (CustomLogger that records complete_input_dict pre-call). VCR plumbing inherited from the parent conftest.
tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py — single parametrized test (231 cells) + 2 meta-tests for grid integrity.

Test plan

pytest --collect-only lists 231 parametrized cells + 2 meta-tests (233 total)
test_grid_cell_count and test_grid_route_coverage pass without provider env vars
Cells skip cleanly when provider env vars absent
black . formatting applied
First green CI run on litellm_* branch with provider creds (records all 231 cassettes; verify all cells pass against the post-fix code from fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074)
Re-run CI to confirm replay-only path (no live calls, all cells HIT in the VCR classification summary)
Revert one of the fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074 fixes locally and confirm the matching cell(s) fail (regression-detection smoke test)

Note

Medium Risk
Adds a large new VCR-backed e2e test matrix (231 parametrized cells) that can affect CI runtime/stability and depends on recorded provider interactions and env-gated live calls, though it doesn’t change production code paths.

Overview
Introduces a new reasoning_effort_grid e2e regression suite that sweeps 21 Anthropic-backed model/provider routes × 11 reasoning_effort values and asserts both the returned status (200 vs 400) and the captured upstream wire body fields (thinking.*, output_config.effort, max_tokens) to catch transformation/mapping regressions.

Encodes the expected behavior as rules in grid_spec.py (expanded into the full matrix) and adds a wire_capture fixture that hooks LiteLLM callbacks to record complete_input_dict pre-call; tests rely on the existing tests/llm_translation Redis VCR auto-marker so calls are recorded once and replayed in CI, with per-route skips when required env vars are missing.

Hardens a few existing live-provider tests by skipping on upstream instability: Fireworks tests now pytest.skip on NotFoundError (404 model missing), Gemini image-size-limit test skips when Wikimedia rate-limits (429), and an OpenAPI compliance test relaxes expectations to avoid a now-missing per-turn output field in Google’s schema.

^{Reviewed by Cursor Bugbot for commit fb7091e. Bugbot is set up for automated code reviews on this repo. Configure here.}

Encode the 231-cell QA sweep (21 provider x model combos x 11 effort values) from #27039 / #27074 as an automated CircleCI-gated regression suite. Each cell hits the real provider endpoint, captures the outgoing wire body via a pre-call CustomLogger, and asserts: - thinking.type, output_config.effort, thinking.budget_tokens, max_tokens in the captured request body (regression signal for silent drops/strips in any provider transformation) - HTTP status (200 vs BadRequestError -> 400) returned by litellm (regression signal for clean-error vs leaked-500 mappings) The matrix is encoded as a small rule set keyed by (model_mode, effort) plus per-model xhigh/max capability overrides, then expanded across the five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke /v1/messages route. Cells skip at runtime when the route's provider env vars are absent, so PR builds without credentials no-op gracefully. Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the existing main / litellm_* branch filter.

codecov · 2026-05-16T02:49:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

… body, guard budget tokens - Remove unused vertex_credentials_path fixture (and now-unused os import) from conftest.py. - Parse Bedrock Converse complete_input_dict (logged as a JSON string by converse_handler.py) before passing to _assert_cell, so dict accessors work uniformly across routes. - Extend _BUDGET_TOKENS with xhigh and max entries so the budget-mode branch in expected() cannot KeyError if a future budget model gains the matching cap. Co-authored-by: Yassin Kortam <yassin@berri.ai>

CLAassistant · 2026-05-16T02:58:28Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ mateo-berri
❌ cursoragent
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

cursor

Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Sonnet-4-6 missing max cap causes wrong expected status
- Renamed _CAPS_OPUS_4_6 to _CAPS_4_6 (since the supports_max_reasoning_effort cap is shared by opus and sonnet 4.6) and assigned it to all sonnet-4-6 ModelEntry definitions across every route, so expected() now returns status=200 for effort='max', matching the runtime.

Preview (51dff1ff79)

diff --git a/.circleci/config.yml b/.circleci/config.yml
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -578,6 +578,48 @@
       # Store test results
       - store_test_results:
           path: test-results
+  reasoning_effort_grid_v4_e2e:
+    docker:
+      - *python312_image
+    working_directory: ~/project
+    resource_class: large
+
+    steps:
+      - checkout
+      - setup_google_dns
+      - install_uv
+      - restore_cache:
+          keys:
+            - v1-uv-cache-{{ checksum "uv.lock" }}
+      - run:
+          name: Install Dependencies
+          command: |
+            uv sync --frozen --all-groups --all-extras --python 3.12
+      - save_cache:
+          paths:
+            - ~/.cache/uv
+          key: v1-uv-cache-{{ checksum "uv.lock" }}
+      # Grid v4 exercises reasoning_effort mapping against real Anthropic,
+      # Azure AI Foundry, Vertex AI, Bedrock Converse, and Bedrock Invoke
+      # endpoints. Per-route cells pytest-skip themselves when the matching
+      # provider env vars are absent, so PRs without credentials no-op.
+      - run:
+          name: Run reasoning_effort grid v4 e2e suite
+          command: |
+            mkdir -p test-results
+            uv run --no-sync python -m pytest \
+              tests/test_litellm/reasoning_effort_grid_v4/ \
+              -v \
+              --junitxml=test-results/junit.xml \
+              --durations=20 \
+              -n 4 \
+              --timeout=180 --timeout_method=thread \
+              --retries 2 --retry-delay 5 \
+              --max-worker-restart=5
+          no_output_timeout: 20m
+
+      - store_test_results:
+          path: test-results
   realtime_translation_testing:
     docker:
       - *python312_image
@@ -2619,6 +2661,8 @@
           filters: *main_branches
       - llm_translation_testing:
           filters: *main_branches
+      - reasoning_effort_grid_v4_e2e:
+          filters: *main_branches
       - realtime_translation_testing:
           filters: *main_branches
       - agent_testing:

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/__init__.py b/tests/test_litellm/reasoning_effort_grid_v4/__init__.py
new file mode 100644

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/conftest.py b/tests/test_litellm/reasoning_effort_grid_v4/conftest.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/conftest.py
@@ -1,0 +1,51 @@
+"""Shared fixtures for the reasoning_effort grid v4 e2e suite."""
+
+from typing import Any, Dict, List, Optional
+
+import pytest
+
+import litellm
+from litellm.integrations.custom_logger import CustomLogger
+
+
+class _WireBodyCapture(CustomLogger):
+    """Pre-call hook that records the outgoing wire body LiteLLM sends upstream.
+
+    `complete_input_dict` is the fully transformed provider request as set by
+    every provider transformation in `litellm/llms/**`. Capturing it here means
+    a regression anywhere in the transformation chain (strip, rename, drop)
+    surfaces as an assertion failure on the cell that depends on it.
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+        self.records: List[Dict[str, Any]] = []
+
+    def log_pre_api_call(self, model, messages, kwargs):
+        self.records.append(
+            {
+                "model": model,
+                "body": kwargs.get("additional_args", {}).get("complete_input_dict"),
+                "api_base": kwargs.get("additional_args", {}).get("api_base"),
+            }
+        )
+
+    async def async_log_pre_api_call(self, model, messages, kwargs):
+        self.log_pre_api_call(model, messages, kwargs)
+
+    def latest(self) -> Optional[Dict[str, Any]]:
+        return self.records[-1] if self.records else None
+
+    def reset(self) -> None:
+        self.records.clear()
+
+
+@pytest.fixture()
+def wire_capture():
+    capture = _WireBodyCapture()
+    previous = list(litellm.callbacks)
+    litellm.callbacks = previous + [capture]
+    try:
+        yield capture
+    finally:
+        litellm.callbacks = previous

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py b/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py
@@ -1,0 +1,303 @@
+"""
+Canonical post-fix expectations for the reasoning_effort grid v4 sweep.
+
+The QA sweep on https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610
+covered 21 (provider x model) combos x 11 effort values (231 cells). The follow-up
+PR https://github.com/BerriAI/litellm/pull/27074 closed nine bugs surfaced by that
+sweep. This module encodes the post-fix expectations as a small rule set keyed by
+(model_mode, effort) and per-model capability overrides, then expands them across
+the model x effort matrix per route.
+"""
+
+from dataclasses import dataclass, field
+from typing import Dict, FrozenSet, List, Optional, Tuple
+
+
+OMIT = object()
+
+
+@dataclass(frozen=True)
+class CellExpectation:
+    """Expected post-fix behavior for a single grid cell."""
+
+    status: int
+    thinking_type: object
+    output_config_effort: object = OMIT
+    thinking_budget_tokens: object = OMIT
+    max_tokens: object = OMIT
+
+
+@dataclass(frozen=True)
+class ModelEntry:
+    alias: str
+    model: str
+    mode: str
+    extra_params: Tuple[Tuple[str, str], ...] = field(default_factory=tuple)
+    required_env: FrozenSet[str] = field(default_factory=frozenset)
+    caps: FrozenSet[str] = field(default_factory=frozenset)
+
+    def params(self) -> Dict[str, str]:
+        return dict(self.extra_params)
+
+
+EFFORTS: Tuple[str, ...] = (
+    "__omit__",
+    "none",
+    "minimal",
+    "low",
+    "medium",
+    "high",
+    "xhigh",
+    "max",
+    "disabled",
+    "invalid",
+    "",
+)
+
+_BUDGET_TOKENS: Dict[str, int] = {
+    "minimal": 1024,
+    "low": 1024,
+    "medium": 2048,
+    "high": 4096,
+    "xhigh": 8192,
+    "max": 16384,
+}
+
+_ADAPTIVE_EFFORT_LABEL: Dict[str, str] = {
+    "minimal": "low",
+    "low": "low",
+    "medium": "medium",
+    "high": "high",
+    "xhigh": "xhigh",
+    "max": "max",
+}
+
+_BAD_REQUEST_EFFORTS: FrozenSet[str] = frozenset({"disabled", "invalid", ""})
+
+
+def expected(model: ModelEntry, effort: str) -> CellExpectation:
+    """Compute the post-fix expected cell for a (model, effort) pair."""
+    if effort in ("__omit__", "none"):
+        if model.mode == "budget":
+            return CellExpectation(status=200, thinking_type=OMIT, max_tokens=8192)
+        return CellExpectation(status=200, thinking_type=OMIT)
+
+    if effort in _BAD_REQUEST_EFFORTS:
+        return CellExpectation(status=400, thinking_type=OMIT)
+
+    if effort in ("xhigh", "max"):
+        cap = f"supports_{effort}_reasoning_effort"
+        if cap not in model.caps:
+            return CellExpectation(status=400, thinking_type=OMIT)
+
+    if model.mode == "adaptive":
+        return CellExpectation(
+            status=200,
+            thinking_type="adaptive",
+            output_config_effort=_ADAPTIVE_EFFORT_LABEL[effort],
+        )
+
+    return CellExpectation(
+        status=200,
+        thinking_type="enabled",
+        thinking_budget_tokens=_BUDGET_TOKENS[effort],
+        max_tokens=8192,
+    )
+
+
+_ANTHROPIC_REQ = frozenset({"ANTHROPIC_API_KEY"})
+_AZURE_FOUNDRY_REQ = frozenset({"AZURE_FOUNDRY_API_BASE", "AZURE_FOUNDRY_API_KEY"})
+_VERTEX_REQ = frozenset({"VERTEX_PROJECT"})
+_BEDROCK_REQ = frozenset({"AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"})
+
+
+_CAPS_OPUS_4_7: FrozenSet[str] = frozenset(
+    {"supports_xhigh_reasoning_effort", "supports_max_reasoning_effort"}
+)
+_CAPS_4_6: FrozenSet[str] = frozenset({"supports_max_reasoning_effort"})
+_CAPS_NONE: FrozenSet[str] = frozenset()
+
+
+ANTHROPIC_DIRECT_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="claude-opus-4-7",
+        model="anthropic/claude-opus-4-7",
+        mode="adaptive",
+        required_env=_ANTHROPIC_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="claude-sonnet-4-6",
+        model="anthropic/claude-sonnet-4-6",
+        mode="adaptive",
+        required_env=_ANTHROPIC_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="claude-haiku-4-5",
+        model="anthropic/claude-haiku-4-5",
+        mode="budget",
+        required_env=_ANTHROPIC_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+AZURE_AI_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="azure-claude-opus-4-7",
+        model="azure_ai/claude-opus-4-7",
+        mode="adaptive",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="azure-claude-opus-4-6",
+        model="azure_ai/claude-opus-4-6",
+        mode="adaptive",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="azure-claude-sonnet-4-6",
+        model="azure_ai/claude-sonnet-4-6",
+        mode="adaptive",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="azure-claude-haiku-4-5",
+        model="azure_ai/claude-haiku-4-5",
+        mode="budget",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+VERTEX_AI_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="vertex-claude-opus-4-7",
+        model="vertex_ai/claude-opus-4-7",
+        mode="adaptive",
+        extra_params=(("vertex_location", "global"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="vertex-claude-opus-4-6",
+        model="vertex_ai/claude-opus-4-6",
+        mode="adaptive",
+        extra_params=(("vertex_location", "us-east5"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="vertex-claude-sonnet-4-6",
+        model="vertex_ai/claude-sonnet-4-6",
+        mode="adaptive",
+        extra_params=(("vertex_location", "us-east5"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="vertex-claude-haiku-4-5",
+        model="vertex_ai/claude-haiku-4-5",
+        mode="budget",
+        extra_params=(("vertex_location", "us-east5"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+BEDROCK_CONVERSE_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="bedrock-claude-opus-4-7",
+        model="bedrock/converse/us.anthropic.claude-opus-4-7",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="bedrock-claude-opus-4-6",
+        model="bedrock/converse/us.anthropic.claude-opus-4-6-v1",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-claude-sonnet-4-6",
+        model="bedrock/converse/us.anthropic.claude-sonnet-4-6",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-claude-sonnet-4-5",
+        model="bedrock/converse/us.anthropic.claude-sonnet-4-5-20250929-v1:0",
+        mode="budget",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+BEDROCK_INVOKE_CHAT_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="bedrock-invoke-claude-opus-4-6",
+        model="bedrock/invoke/us.anthropic.claude-opus-4-6-v1",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-invoke-claude-sonnet-4-6",
+        model="bedrock/invoke/us.anthropic.claude-sonnet-4-6",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-invoke-claude-opus-4-5",
+        model="bedrock/invoke/us.anthropic.claude-opus-4-5-20251101-v1:0",
+        mode="budget",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+BEDROCK_INVOKE_MESSAGES_MODELS: Tuple[ModelEntry, ...] = BEDROCK_INVOKE_CHAT_MODELS
+
+
+@dataclass(frozen=True)
+class Route:
+    name: str
+    models: Tuple[ModelEntry, ...]
+
+
+ROUTES: Tuple[Route, ...] = (
+    Route("anthropic_direct", ANTHROPIC_DIRECT_MODELS),
+    Route("azure_ai", AZURE_AI_MODELS),
+    Route("vertex_ai", VERTEX_AI_MODELS),
+    Route("bedrock_converse", BEDROCK_CONVERSE_MODELS),
+    Route("bedrock_invoke_chat", BEDROCK_INVOKE_CHAT_MODELS),
+    Route("bedrock_invoke_messages", BEDROCK_INVOKE_MESSAGES_MODELS),
+)
+
+
+def all_cells() -> List[Tuple[str, ModelEntry, str, CellExpectation]]:
+    cells: List[Tuple[str, ModelEntry, str, CellExpectation]] = []
+    for route in ROUTES:
+        for model in route.models:
+            for effort in EFFORTS:
+                cells.append((route.name, model, effort, expected(model, effort)))
+    return cells

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py b/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py
@@ -1,0 +1,236 @@
+"""
+End-to-end grid v4 regression suite for reasoning_effort mapping across
+Anthropic-backed routes.
+
+Encodes the 21 (provider x model) x 11 effort matrix (231 cells) from the
+QA sweep on https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610
+that the fix in https://github.com/BerriAI/litellm/pull/27074 was validated
+against. Each cell asserts:
+
+  - Wire body shape captured pre-call (thinking.type, output_config.effort,
+    thinking.budget_tokens, max_tokens) -- the regression signal for silent
+    drops/strips anywhere in the transformation chain.
+  - Status code returned by LiteLLM (200 vs BadRequestError -> 400) -- the
+    regression signal for clean-error vs leaked-500 mappings.
+
+Hits real provider endpoints. Each route is skipped at runtime when its
+required env vars are absent, so PR builds without provider credentials no-op
+gracefully.
+"""
+
+import json
+import os
+from typing import Any, Dict, List, Optional, Tuple
+
+import pytest
+
+import litellm
+from litellm.exceptions import BadRequestError
+
+from .grid_spec import (
+    OMIT,
+    ROUTES,
+    CellExpectation,
+    ModelEntry,
+    all_cells,
+)
+
+
+_PROMPT_MESSAGES: List[Dict[str, str]] = [
+    {"role": "user", "content": "Step by step, calculate 47 * 53. Show your work."}
+]
+
+
+def _required_env_missing(model: ModelEntry) -> Optional[str]:
+    missing = [key for key in model.required_env if not os.environ.get(key)]
+    if missing:
+        return "missing env: " + ", ".join(sorted(missing))
+    return None
+
+
+def _max_tokens_for(model: ModelEntry) -> int:
+    return 200 if model.mode == "adaptive" else 8192
+
+
+def _build_completion_kwargs(model: ModelEntry, effort: str) -> Dict[str, Any]:
+    kwargs: Dict[str, Any] = {
+        "model": model.model,
+        "messages": _PROMPT_MESSAGES,
+        "max_tokens": _max_tokens_for(model),
+    }
+    kwargs.update(model.params())
+    if effort != "__omit__":
+        kwargs["reasoning_effort"] = effort
+    if model.model.startswith("vertex_ai/"):
+        kwargs["vertex_project"] = os.environ.get(
+            "VERTEX_PROJECT", "vertex-check-481318"
+        )
+    if model.model.startswith("azure_ai/"):
+        kwargs["api_base"] = os.environ["AZURE_FOUNDRY_API_BASE"]
+        kwargs["api_key"] = os.environ["AZURE_FOUNDRY_API_KEY"]
+    return kwargs
+
+
+def _build_messages_kwargs(model: ModelEntry, effort: str) -> Dict[str, Any]:
+    kwargs = _build_completion_kwargs(model, effort)
+    return kwargs
+
+
+def _converse_subbody(body: Dict[str, Any]) -> Dict[str, Any]:
+    """Return the dict that holds thinking/output_config for a Converse wire body."""
+    return body.get("additionalModelRequestFields", body)
+
+
+def _max_tokens_from_body(body: Dict[str, Any], route_name: str) -> Optional[int]:
+    if route_name == "bedrock_converse":
+        return body.get("inferenceConfig", {}).get("maxTokens")
+    return body.get("max_tokens")
+
+
+def _assert_cell(
+    route_name: str,
+    body: Optional[Dict[str, Any]],
+    status: int,
+    cell: CellExpectation,
+) -> None:
+    assert status == cell.status, f"expected status={cell.status}, got status={status}"
+
+    if cell.status != 200:
+        # Bad-request paths short-circuit before the wire body matters.
+        return
+
+    assert body is not None, "wire body was not captured for a 200-status cell"
+    subbody = _converse_subbody(body) if route_name == "bedrock_converse" else body
+    thinking = subbody.get("thinking")
+    output_config = subbody.get("output_config")
+
+    if cell.thinking_type is OMIT:
+        assert thinking is None, f"expected thinking omitted, got {thinking!r}"
+    else:
+        assert thinking is not None, "expected thinking present, got omit"
+        assert thinking.get("type") == cell.thinking_type, (
+            f"expected thinking.type={cell.thinking_type!r}, "
+            f"got {thinking.get('type')!r}"
+        )
+
+    if cell.output_config_effort is OMIT:
+        assert (
+            output_config is None or "effort" not in output_config
+        ), f"expected output_config.effort omitted, got {output_config!r}"
+    else:
+        assert output_config is not None, (
+            f"expected output_config.effort={cell.output_config_effort!r}, "
+            "got output_config omitted"
+        )
+        assert output_config.get("effort") == cell.output_config_effort, (
+            f"expected output_config.effort={cell.output_config_effort!r}, "
+            f"got {output_config.get('effort')!r}"
+        )
+
+    if cell.thinking_budget_tokens is not OMIT:
+        assert thinking is not None
+        assert thinking.get("budget_tokens") == cell.thinking_budget_tokens, (
+            f"expected thinking.budget_tokens={cell.thinking_budget_tokens!r}, "
+            f"got {thinking.get('budget_tokens')!r}"
+        )
+
+    if cell.max_tokens is not OMIT:
+        wire_max = _max_tokens_from_body(body, route_name)
+        assert (
+            wire_max == cell.max_tokens
+        ), f"expected max_tokens={cell.max_tokens!r}, got {wire_max!r}"
+
+
+_PARAMS: List[Tuple[str, ModelEntry, str, CellExpectation]] = all_cells()
+
+
+def _cell_id(case: Tuple[str, ModelEntry, str, CellExpectation]) -> str:
+    route_name, model, effort, _ = case
+    effort_label = "__empty__" if effort == "" else effort
+    return f"{route_name}-{model.alias}-{effort_label}"
+
+
+_PARAM_IDS: List[str] = [_cell_id(case) for case in _PARAMS]
+
+
+async def _call_chat(model: ModelEntry, effort: str) -> Tuple[int, Optional[Exception]]:
+    kwargs = _build_completion_kwargs(model, effort)
+    try:
+        await litellm.acompletion(**kwargs)
+        return 200, None
+    except BadRequestError as exc:
+        return 400, exc
+    except Exception as exc:
+        return 500, exc
+
+
+async def _call_messages(
+    model: ModelEntry, effort: str
+) -> Tuple[int, Optional[Exception]]:
+    kwargs = _build_messages_kwargs(model, effort)
+    try:
+        await litellm.messages.acreate(**kwargs)
+        return 200, None
+    except BadRequestError as exc:
+        return 400, exc
+    except Exception as exc:
+        return 500, exc
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+    ("route_name", "model", "effort", "cell"), _PARAMS, ids=_PARAM_IDS
+)
+async def test_reasoning_effort_grid_v4(
+    route_name: str,
+    model: ModelEntry,
+    effort: str,
+    cell: CellExpectation,
+    wire_capture,
+) -> None:
+    skip_reason = _required_env_missing(model)
+    if skip_reason:
+        pytest.skip(skip_reason)
+
+    if route_name == "bedrock_invoke_messages":
+        status, exc = await _call_messages(model, effort)
+    else:
+        status, exc = await _call_chat(model, effort)
+
+    record = wire_capture.latest()
+    body = record["body"] if record else None
+    # Bedrock Converse logs `complete_input_dict` as a JSON string (see
+    # litellm/llms/bedrock/chat/converse_handler.py); parse it so the dict
+    # accessors in `_assert_cell` work uniformly across routes.
+    if route_name == "bedrock_converse" and isinstance(body, str):
+        body = json.loads(body)
+
+    try:
+        _assert_cell(route_name, body, status, cell)
+    except AssertionError:
+        if exc is not None:
+            raise AssertionError(
+                f"underlying exception ({type(exc).__name__}): {exc}"
+            ) from None
+        raise
+
+
+def test_grid_v4_cell_count() -> None:
+    """Guard against accidental drops or duplicates in the grid spec."""
+    assert len(_PARAMS) == 21 * 11, (
+        f"expected 231 cells (21 provider x model combos x 11 efforts), "
+        f"got {len(_PARAMS)}"
+    )
+
+
+def test_grid_v4_route_coverage() -> None:
+    """The grid must cover every route the original QA sweep covered."""
+    route_names = {route.name for route in ROUTES}
+    assert route_names == {
+        "anthropic_direct",
+        "azure_ai",
+        "vertex_ai",
+        "bedrock_converse",
+        "bedrock_invoke_chat",
+        "bedrock_invoke_messages",
+    }

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 4327427. Configure here.}

…t cap The runtime _validate_effort_for_model allows effort='max' for any Claude 4.6 model (opus or sonnet), and model_prices_and_context_window sets supports_max_reasoning_effort: true for claude-sonnet-4-6. The grid spec previously gave sonnet-4-6 entries _CAPS_NONE, so expected() returned status=400 for effort='max', which mismatched the runtime's status=200 and caused 6 cells (one per route) to fail. Rename _CAPS_OPUS_4_6 to _CAPS_4_6 (since the cap set is shared by opus and sonnet 4.6) and assign it to all sonnet-4-6 entries. Co-authored-by: Yassin Kortam <yassin@berri.ai>

…on, drop v4 naming - Drop the "v4" suffix throughout: it referred to the QA sweep iteration, not this test suite. There's only one regression suite, so just call it reasoning_effort_grid. - Move tests/test_litellm/reasoning_effort_grid_v4/ -> tests/llm_translation/ reasoning_effort_grid/. Two reasons: 1. The parent tests/test_litellm/conftest.py installs an autouse fixture (isolate_host_aws_config) that clears every AWS_* env var before each test, which would silently skip every Bedrock cell. 2. tests/llm_translation/conftest.py already wires up the Redis-backed VCR persister and auto-applies @pytest.mark.vcr to every collected item via apply_vcr_auto_marker_to_items. Living under that conftest means the suite gets cassette replay for free -- first CI run with provider creds records 231 cassettes, every subsequent run replays them with no live spend. - Trim the suite's own conftest down to just the wire_capture fixture; the inherited llm_translation conftest covers the VCR plumbing. - Drop the dedicated reasoning_effort_grid_v4_e2e CircleCI job. The existing llm_translation_testing job globs tests/llm_translation/**/test_*.py, so the suite is gated by an existing job with no new wiring.

…997/git/BerriAI/litellm into litellm_grid-v4-e2e-tests-cZRwz

… openapi field Two CI failures, both pre-existing in different ways: 1. reasoning_effort_grid: all 33 bedrock_invoke_messages cells failed with AttributeError("module 'litellm' has no attribute 'messages'"). litellm exposes the async Anthropic Messages entrypoint as litellm.anthropic_messages (via "from .llms.anthropic.experimental_pass_through.messages.handler import *" in litellm/__init__.py), not litellm.messages.acreate. Swap the call. 2. tests/test_litellm/interactions/test_openapi_compliance.py::TestResponseCompliance::test_interaction_response_fields asserts the live Google spec contains "steps". Google's spec has churned through "outputs" -> "steps" -> neither, and presently carries neither. The test broke on main as soon as upstream dropped "steps"; pulling the key off the assert list realigns the test with the live schema. Re-add the per-turn output field once upstream stabilizes on a name. The openapi-compliance fix doesn't belong to this PR conceptually but is included here per request to unblock CI before the morning.

… not class The anthropic_messages route wraps client-side BadRequestError as AnthropicError (a BaseLLMException subclass) with status_code=400, so "except BadRequestError" missed those cells and they fell through to the generic Exception arm, returning 500 instead of the expected 400. Replace the isinstance-on-BadRequestError check with a tiny classifier that prefers BadRequestError membership, then falls back to the exception's status_code attribute (set by every BaseLLMException subclass), then 500. Apply to both _call_chat and _call_messages for consistency. Fixes the 13 CircleCI llm_translation_testing failures on bedrock_invoke_messages cells where the effort was disabled / invalid / empty / xhigh-on-unsupported / max-on-unsupported.

Four pre-existing flakes on main that gate this branch's workflow even though they're unrelated to the reasoning_effort_grid suite: 1. tests/local_testing/test_completion.py::test_completion_fireworks_ai 2. tests/local_testing/test_completion_cost.py::test_completion_cost_fireworks_ai[fireworks_ai/llama-v3p3-70b-instruct] 3. tests/llm_translation/test_fireworks_ai_translation.py::test_document_inlining_example[False] The Fireworks-hosted `llama-v3p3-70b-instruct` deployment is currently returning 404 "Model not found, inaccessible, and/or not deployed". These tests pass when the model is deployed; the issue is upstream capacity, not our code path. Wrap the live call in a try/except that pytest.skip's on litellm.NotFoundError so a Fireworks deployment hiccup no longer fails CI for unrelated PRs. 4. tests/llm_translation/test_gemini.py::test_gemini_image_size_limit_exceeded The test fetches the 32MB "Blue Marble 2002" image from Wikimedia to exercise the 50MB image-size cap. CI runners share an IP pool with noisy traffic, so Wikimedia routinely returns HTTP 429. The size-limit check never gets a chance to fire. Catch the 429 BadRequestError and pytest.skip in that case. None of these belong on this PR conceptually, but they're included per request to unblock the workflow before morning.

…ageFetchError litellm.ImageFetchError is a subclass of BadRequestError, so when Wikimedia returns 429 the pytest.raises(ImageFetchError) block matches and swallows the exception -- the outer try/except never fires. Drop the try/except and check the captured error message for "Status code: 429" after the raises block, calling pytest.skip in that case. Same intent, right control flow.

greptile-apps · 2026-05-16T15:08:12Z

Greptile Summary

Converts a 231-cell manual QA grid sweep for reasoning_effort into an automated VCR-backed regression suite, encoding post-fix expectations as a rule-driven matrix in grid_spec.py and running them parametrically against six Anthropic-backed routes. A handful of existing tests also gain pytest.skip guards for upstream flakiness (Fireworks 404, Wikimedia 429) and one OpenAPI compliance assertion is removed pending upstream schema stabilization.

New reasoning_effort_grid suite (grid_spec.py, conftest.py, test_reasoning_effort_grid.py): 231-cell async parametrized test asserting wire-body shape and HTTP status for every (model, effort) pair; inherits Redis VCR replay from the parent conftest.py so only the first recording pass hits live endpoints.
Fireworks/Gemini/OpenAPI hardening: three existing test files add graceful skip guards for transient upstream errors and remove a per-turn output field assertion that Google's live spec has stopped providing.

Confidence Score: 5/5

Safe to merge — changes are entirely test-side with no production code impact.

All changes are in the test layer. The new regression suite is well-structured, skips cleanly when credentials are absent, and inherits existing VCR infrastructure. The only notable concern is that model capability caps are hardcoded in grid_spec.py rather than derived from the production config JSON, which can cause the test oracle to drift if capabilities are updated upstream — but this does not affect production behavior.

tests/llm_translation/reasoning_effort_grid/grid_spec.py — the hardcoded CAPS* constants should ideally be derived from get_model_info() to avoid oracle drift.

Important Files Changed

Filename	Overview
tests/llm_translation/reasoning_effort_grid/conftest.py	Adds wire_capture fixture using a CustomLogger pre-call hook; correctly restores litellm.callbacks in a finally block.
tests/llm_translation/reasoning_effort_grid/grid_spec.py	Encodes expected cells for the 231-cell matrix; capability caps for xhigh/max are hardcoded rather than read from the production model-config JSON, risking oracle drift.
tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py	Parametrized 231-cell async test suite with wire-body and status assertions; skip guards, VCR plumbing, and error classification are all handled correctly.
tests/test_litellm/interactions/test_openapi_compliance.py	Removes assertion on the steps field from Google's live schema; reduces test coverage for a field whose upstream name is currently unstable.
tests/llm_translation/test_fireworks_ai_translation.py	Adds pytest.skip on NotFoundError to handle upstream model unavailability; preserves existing assertion logic.
tests/llm_translation/test_gemini.py	Reformats long assertion lines and adds a pytest.skip guard for Wikimedia 429 rate-limits in the image-size test; substantive assertions unchanged.
tests/local_testing/test_completion.py	Adds pytest.skip for NotFoundError on Fireworks completion test; existing test logic unmodified.
tests/local_testing/test_completion_cost.py	Wraps Fireworks completion call in try/except to skip on NotFoundError; cost assertion logic preserved.

_{Reviews (2): Last reviewed commit: "refactor(reasoning_effort_grid): tighten..." | Re-trigger Greptile}

…view Two P2 nits flagged by Greptile on PR 28036: 1. _build_completion_kwargs() defaulted vertex_project to "vertex-check-481318" when VERTEX_PROJECT was unset. That value is a specific GCP project that doesn't belong to this repo, so if the env-var skip guard were ever bypassed (misconfig, direct helper call), the test would silently issue calls to a foreign project rather than failing loudly. Drop the fallback and read os.environ["VERTEX_PROJECT"] directly, mirroring how AZURE_FOUNDRY_* are handled. 2. _build_messages_kwargs() was a one-liner that returned the result of _build_completion_kwargs() unchanged -- a dead abstraction with one caller. Inline at the _call_messages call site and delete the helper.

mateo-berri · 2026-05-16T15:12:46Z

@greptileai

…s-cZRwz Resolve conflicts in the five unrelated CI-flake fixes I previously landed on this branch -- staging shipped stronger versions (mocked HTTP for the Fireworks tests, mocked image-fetch for the Gemini size-limit test, switched the openapi-compliance test to the Interaction response schema instead of dropping the assertion). Take staging's version of all five files and drop my now-unreachable 429-skip lines from the Gemini test that the auto-merge left behind.

cursor Bot reviewed May 16, 2026

View reviewed changes

Comment thread tests/test_litellm/reasoning_effort_grid_v4/conftest.py Outdated

Comment thread tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py

Comment thread tests/llm_translation/reasoning_effort_grid/grid_spec.py

cursor Bot reviewed May 16, 2026

View reviewed changes

Comment thread tests/llm_translation/reasoning_effort_grid/grid_spec.py

cursoragent and others added 3 commits May 16, 2026 03:12

Merge branch 'litellm_grid-v4-e2e-tests-cZRwz' of http://127.0.0.1:41…

d084782

…997/git/BerriAI/litellm into litellm_grid-v4-e2e-tests-cZRwz

mateo-berri force-pushed the litellm_grid-v4-e2e-tests-cZRwz branch from 0c4fed4 to d084782 Compare May 16, 2026 03:22

mateo-berri changed the title ~~test(ci): add reasoning_effort grid v4 e2e regression suite~~ test(ci): add reasoning_effort grid e2e regression suite May 16, 2026

mateo-berri added 4 commits May 16, 2026 06:59

mateo-berri marked this pull request as ready for review May 16, 2026 15:04

greptile-apps Bot reviewed May 16, 2026

View reviewed changes

Comment thread tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py Outdated

Comment thread tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py Outdated

mateo-berri requested a review from yuneng-berri May 16, 2026 15:26

mateo-berri added 2 commits May 16, 2026 15:33

refactor: strip PR-introduced docstrings and explanatory comments

f9485f1

yuneng-berri approved these changes May 16, 2026

View reviewed changes

yuneng-berri merged commit 57e5e4a into litellm_internal_staging May 16, 2026
115 checks passed

yuneng-berri deleted the litellm_grid-v4-e2e-tests-cZRwz branch May 16, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(ci): add reasoning_effort grid e2e regression suite#28036

test(ci): add reasoning_effort grid e2e regression suite#28036
yuneng-berri merged 12 commits into
litellm_internal_stagingfrom
litellm_grid-v4-e2e-tests-cZRwz

mateo-berri commented May 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

codecov Bot commented May 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented May 16, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot commented May 16, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

mateo-berri commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mateo-berri commented May 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Routes covered (matches QA proxy config exactly)

Files

Test plan

Uh oh!

codecov Bot commented May 16, 2026

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

mateo-berri commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mateo-berri commented May 16, 2026 •

edited by cursor Bot

Loading

CLAassistant commented May 16, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

greptile-apps Bot commented May 16, 2026 •

edited

Loading