Skip to content

test(ci): add reasoning_effort grid e2e regression suite#28036

Merged
yuneng-berri merged 12 commits into
litellm_internal_stagingfrom
litellm_grid-v4-e2e-tests-cZRwz
May 16, 2026
Merged

test(ci): add reasoning_effort grid e2e regression suite#28036
yuneng-berri merged 12 commits into
litellm_internal_stagingfrom
litellm_grid-v4-e2e-tests-cZRwz

Conversation

@mateo-berri

@mateo-berri mateo-berri commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

Converts the manual QA grid sweep from #27074 (231 cells — 21 provider × model combos × 11 effort values, validating the fix for #27039) into an automated regression suite. Each cell hits a real provider endpoint once, then VCR-replays from a Redis-backed cassette on every subsequent CI run — no live spend after the first record pass.

  • Wire body assertions per cell: thinking.type, output_config.effort, thinking.budget_tokens, max_tokens — catches any silent drop/strip in a provider transformation (the bug class fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074 fixed for Bedrock + Vertex + Bedrock-Invoke /v1/messages).
  • Status assertions per cell: 200 vs BadRequestError → 400 — catches regressions in the ValueErrorBadRequestError mappings (disabled / invalid / \"\" / unsupported xhigh / max on non-supporting models).
  • Grid encoded as a rule set, not 231 hardcoded cells: post-fix expectations live in tests/llm_translation/reasoning_effort_grid/grid_spec.py keyed by (model_mode, effort) plus per-model supports_xhigh_reasoning_effort / supports_max_reasoning_effort overrides, then expanded across the 5 chat-completion routes and the Bedrock Invoke /v1/messages route. Adding a new model or effort is a one-line change.
  • VCR via existing tests/llm_translation/conftest.py: that conftest auto-applies @pytest.mark.vcr to every collected item and registers the Redis-backed cassette persister (CASSETTE_REDIS_URL). First CI run with provider creds records 231 cassettes; every subsequent run replays them. No new VCR plumbing in this PR.
  • CI gating: no new job — the existing llm_translation_testing workflow already globs tests/llm_translation/**/test_*.py, so the suite is gated by an existing job behind the *main_branches filter (main + litellm_*). Per-cell pytest.skip when route env vars are absent, so PR builds without credentials no-op gracefully.

Routes covered (matches QA proxy config exactly)

Route Models Cells
Anthropic direct opus-4-7, sonnet-4-6, haiku-4-5 33
Azure AI Foundry opus-4-7, opus-4-6, sonnet-4-6, haiku-4-5 44
Vertex AI opus-4-7, opus-4-6, sonnet-4-6, haiku-4-5 44
Bedrock Converse opus-4-7, opus-4-6, sonnet-4-6, sonnet-4-5 44
Bedrock Invoke /chat opus-4-6, sonnet-4-6, opus-4-5 33
Bedrock Invoke /v1/messages opus-4-6, sonnet-4-6, opus-4-5 33
Total 21 231

Files

  • tests/llm_translation/reasoning_effort_grid/grid_spec.py — post-fix expectation rule set + model matrix per route.
  • tests/llm_translation/reasoning_effort_grid/conftest.pywire_capture fixture (CustomLogger that records complete_input_dict pre-call). VCR plumbing inherited from the parent conftest.
  • tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py — single parametrized test (231 cells) + 2 meta-tests for grid integrity.

Test plan


Note

Medium Risk
Adds a large new VCR-backed e2e test matrix (231 parametrized cells) that can affect CI runtime/stability and depends on recorded provider interactions and env-gated live calls, though it doesn’t change production code paths.

Overview
Introduces a new reasoning_effort_grid e2e regression suite that sweeps 21 Anthropic-backed model/provider routes × 11 reasoning_effort values and asserts both the returned status (200 vs 400) and the captured upstream wire body fields (thinking.*, output_config.effort, max_tokens) to catch transformation/mapping regressions.

Encodes the expected behavior as rules in grid_spec.py (expanded into the full matrix) and adds a wire_capture fixture that hooks LiteLLM callbacks to record complete_input_dict pre-call; tests rely on the existing tests/llm_translation Redis VCR auto-marker so calls are recorded once and replayed in CI, with per-route skips when required env vars are missing.

Hardens a few existing live-provider tests by skipping on upstream instability: Fireworks tests now pytest.skip on NotFoundError (404 model missing), Gemini image-size-limit test skips when Wikimedia rate-limits (429), and an OpenAPI compliance test relaxes expectations to avoid a now-missing per-turn output field in Google’s schema.

Reviewed by Cursor Bugbot for commit fb7091e. Bugbot is set up for automated code reviews on this repo. Configure here.

Encode the 231-cell QA sweep (21 provider x model combos x 11 effort
values) from #27039 / #27074 as an automated CircleCI-gated regression
suite. Each cell hits the real provider endpoint, captures the outgoing
wire body via a pre-call CustomLogger, and asserts:

- thinking.type, output_config.effort, thinking.budget_tokens, max_tokens
  in the captured request body (regression signal for silent drops/strips
  in any provider transformation)
- HTTP status (200 vs BadRequestError -> 400) returned by litellm
  (regression signal for clean-error vs leaked-500 mappings)

The matrix is encoded as a small rule set keyed by (model_mode, effort)
plus per-model xhigh/max capability overrides, then expanded across the
five chat-completion routes (Anthropic direct, Azure AI Foundry, Vertex
AI, Bedrock Converse, Bedrock Invoke /chat) and the Bedrock Invoke
/v1/messages route. Cells skip at runtime when the route's provider env
vars are absent, so PR builds without credentials no-op gracefully.

Wired into CircleCI as the reasoning_effort_grid_v4_e2e job behind the
existing main / litellm_* branch filter.
@codecov

codecov Bot commented May 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread tests/test_litellm/reasoning_effort_grid_v4/conftest.py Outdated
Comment thread tests/llm_translation/reasoning_effort_grid/grid_spec.py
… body, guard budget tokens

- Remove unused vertex_credentials_path fixture (and now-unused os import)
  from conftest.py.
- Parse Bedrock Converse complete_input_dict (logged as a JSON string by
  converse_handler.py) before passing to _assert_cell, so dict accessors
  work uniformly across routes.
- Extend _BUDGET_TOKENS with xhigh and max entries so the budget-mode
  branch in expected() cannot KeyError if a future budget model gains
  the matching cap.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
@CLAassistant

CLAassistant commented May 16, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ mateo-berri
❌ cursoragent
You have signed the CLA already but the status is still pending? Let us recheck it.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Sonnet-4-6 missing max cap causes wrong expected status
    • Renamed _CAPS_OPUS_4_6 to _CAPS_4_6 (since the supports_max_reasoning_effort cap is shared by opus and sonnet 4.6) and assigned it to all sonnet-4-6 ModelEntry definitions across every route, so expected() now returns status=200 for effort='max', matching the runtime.
Preview (51dff1ff79)
diff --git a/.circleci/config.yml b/.circleci/config.yml
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -578,6 +578,48 @@
       # Store test results
       - store_test_results:
           path: test-results
+  reasoning_effort_grid_v4_e2e:
+    docker:
+      - *python312_image
+    working_directory: ~/project
+    resource_class: large
+
+    steps:
+      - checkout
+      - setup_google_dns
+      - install_uv
+      - restore_cache:
+          keys:
+            - v1-uv-cache-{{ checksum "uv.lock" }}
+      - run:
+          name: Install Dependencies
+          command: |
+            uv sync --frozen --all-groups --all-extras --python 3.12
+      - save_cache:
+          paths:
+            - ~/.cache/uv
+          key: v1-uv-cache-{{ checksum "uv.lock" }}
+      # Grid v4 exercises reasoning_effort mapping against real Anthropic,
+      # Azure AI Foundry, Vertex AI, Bedrock Converse, and Bedrock Invoke
+      # endpoints. Per-route cells pytest-skip themselves when the matching
+      # provider env vars are absent, so PRs without credentials no-op.
+      - run:
+          name: Run reasoning_effort grid v4 e2e suite
+          command: |
+            mkdir -p test-results
+            uv run --no-sync python -m pytest \
+              tests/test_litellm/reasoning_effort_grid_v4/ \
+              -v \
+              --junitxml=test-results/junit.xml \
+              --durations=20 \
+              -n 4 \
+              --timeout=180 --timeout_method=thread \
+              --retries 2 --retry-delay 5 \
+              --max-worker-restart=5
+          no_output_timeout: 20m
+
+      - store_test_results:
+          path: test-results
   realtime_translation_testing:
     docker:
       - *python312_image
@@ -2619,6 +2661,8 @@
           filters: *main_branches
       - llm_translation_testing:
           filters: *main_branches
+      - reasoning_effort_grid_v4_e2e:
+          filters: *main_branches
       - realtime_translation_testing:
           filters: *main_branches
       - agent_testing:

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/__init__.py b/tests/test_litellm/reasoning_effort_grid_v4/__init__.py
new file mode 100644

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/conftest.py b/tests/test_litellm/reasoning_effort_grid_v4/conftest.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/conftest.py
@@ -1,0 +1,51 @@
+"""Shared fixtures for the reasoning_effort grid v4 e2e suite."""
+
+from typing import Any, Dict, List, Optional
+
+import pytest
+
+import litellm
+from litellm.integrations.custom_logger import CustomLogger
+
+
+class _WireBodyCapture(CustomLogger):
+    """Pre-call hook that records the outgoing wire body LiteLLM sends upstream.
+
+    `complete_input_dict` is the fully transformed provider request as set by
+    every provider transformation in `litellm/llms/**`. Capturing it here means
+    a regression anywhere in the transformation chain (strip, rename, drop)
+    surfaces as an assertion failure on the cell that depends on it.
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+        self.records: List[Dict[str, Any]] = []
+
+    def log_pre_api_call(self, model, messages, kwargs):
+        self.records.append(
+            {
+                "model": model,
+                "body": kwargs.get("additional_args", {}).get("complete_input_dict"),
+                "api_base": kwargs.get("additional_args", {}).get("api_base"),
+            }
+        )
+
+    async def async_log_pre_api_call(self, model, messages, kwargs):
+        self.log_pre_api_call(model, messages, kwargs)
+
+    def latest(self) -> Optional[Dict[str, Any]]:
+        return self.records[-1] if self.records else None
+
+    def reset(self) -> None:
+        self.records.clear()
+
+
+@pytest.fixture()
+def wire_capture():
+    capture = _WireBodyCapture()
+    previous = list(litellm.callbacks)
+    litellm.callbacks = previous + [capture]
+    try:
+        yield capture
+    finally:
+        litellm.callbacks = previous

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py b/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/grid_spec.py
@@ -1,0 +1,303 @@
+"""
+Canonical post-fix expectations for the reasoning_effort grid v4 sweep.
+
+The QA sweep on https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610
+covered 21 (provider x model) combos x 11 effort values (231 cells). The follow-up
+PR https://github.com/BerriAI/litellm/pull/27074 closed nine bugs surfaced by that
+sweep. This module encodes the post-fix expectations as a small rule set keyed by
+(model_mode, effort) and per-model capability overrides, then expands them across
+the model x effort matrix per route.
+"""
+
+from dataclasses import dataclass, field
+from typing import Dict, FrozenSet, List, Optional, Tuple
+
+
+OMIT = object()
+
+
+@dataclass(frozen=True)
+class CellExpectation:
+    """Expected post-fix behavior for a single grid cell."""
+
+    status: int
+    thinking_type: object
+    output_config_effort: object = OMIT
+    thinking_budget_tokens: object = OMIT
+    max_tokens: object = OMIT
+
+
+@dataclass(frozen=True)
+class ModelEntry:
+    alias: str
+    model: str
+    mode: str
+    extra_params: Tuple[Tuple[str, str], ...] = field(default_factory=tuple)
+    required_env: FrozenSet[str] = field(default_factory=frozenset)
+    caps: FrozenSet[str] = field(default_factory=frozenset)
+
+    def params(self) -> Dict[str, str]:
+        return dict(self.extra_params)
+
+
+EFFORTS: Tuple[str, ...] = (
+    "__omit__",
+    "none",
+    "minimal",
+    "low",
+    "medium",
+    "high",
+    "xhigh",
+    "max",
+    "disabled",
+    "invalid",
+    "",
+)
+
+_BUDGET_TOKENS: Dict[str, int] = {
+    "minimal": 1024,
+    "low": 1024,
+    "medium": 2048,
+    "high": 4096,
+    "xhigh": 8192,
+    "max": 16384,
+}
+
+_ADAPTIVE_EFFORT_LABEL: Dict[str, str] = {
+    "minimal": "low",
+    "low": "low",
+    "medium": "medium",
+    "high": "high",
+    "xhigh": "xhigh",
+    "max": "max",
+}
+
+_BAD_REQUEST_EFFORTS: FrozenSet[str] = frozenset({"disabled", "invalid", ""})
+
+
+def expected(model: ModelEntry, effort: str) -> CellExpectation:
+    """Compute the post-fix expected cell for a (model, effort) pair."""
+    if effort in ("__omit__", "none"):
+        if model.mode == "budget":
+            return CellExpectation(status=200, thinking_type=OMIT, max_tokens=8192)
+        return CellExpectation(status=200, thinking_type=OMIT)
+
+    if effort in _BAD_REQUEST_EFFORTS:
+        return CellExpectation(status=400, thinking_type=OMIT)
+
+    if effort in ("xhigh", "max"):
+        cap = f"supports_{effort}_reasoning_effort"
+        if cap not in model.caps:
+            return CellExpectation(status=400, thinking_type=OMIT)
+
+    if model.mode == "adaptive":
+        return CellExpectation(
+            status=200,
+            thinking_type="adaptive",
+            output_config_effort=_ADAPTIVE_EFFORT_LABEL[effort],
+        )
+
+    return CellExpectation(
+        status=200,
+        thinking_type="enabled",
+        thinking_budget_tokens=_BUDGET_TOKENS[effort],
+        max_tokens=8192,
+    )
+
+
+_ANTHROPIC_REQ = frozenset({"ANTHROPIC_API_KEY"})
+_AZURE_FOUNDRY_REQ = frozenset({"AZURE_FOUNDRY_API_BASE", "AZURE_FOUNDRY_API_KEY"})
+_VERTEX_REQ = frozenset({"VERTEX_PROJECT"})
+_BEDROCK_REQ = frozenset({"AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"})
+
+
+_CAPS_OPUS_4_7: FrozenSet[str] = frozenset(
+    {"supports_xhigh_reasoning_effort", "supports_max_reasoning_effort"}
+)
+_CAPS_4_6: FrozenSet[str] = frozenset({"supports_max_reasoning_effort"})
+_CAPS_NONE: FrozenSet[str] = frozenset()
+
+
+ANTHROPIC_DIRECT_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="claude-opus-4-7",
+        model="anthropic/claude-opus-4-7",
+        mode="adaptive",
+        required_env=_ANTHROPIC_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="claude-sonnet-4-6",
+        model="anthropic/claude-sonnet-4-6",
+        mode="adaptive",
+        required_env=_ANTHROPIC_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="claude-haiku-4-5",
+        model="anthropic/claude-haiku-4-5",
+        mode="budget",
+        required_env=_ANTHROPIC_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+AZURE_AI_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="azure-claude-opus-4-7",
+        model="azure_ai/claude-opus-4-7",
+        mode="adaptive",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="azure-claude-opus-4-6",
+        model="azure_ai/claude-opus-4-6",
+        mode="adaptive",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="azure-claude-sonnet-4-6",
+        model="azure_ai/claude-sonnet-4-6",
+        mode="adaptive",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="azure-claude-haiku-4-5",
+        model="azure_ai/claude-haiku-4-5",
+        mode="budget",
+        required_env=_AZURE_FOUNDRY_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+VERTEX_AI_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="vertex-claude-opus-4-7",
+        model="vertex_ai/claude-opus-4-7",
+        mode="adaptive",
+        extra_params=(("vertex_location", "global"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="vertex-claude-opus-4-6",
+        model="vertex_ai/claude-opus-4-6",
+        mode="adaptive",
+        extra_params=(("vertex_location", "us-east5"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="vertex-claude-sonnet-4-6",
+        model="vertex_ai/claude-sonnet-4-6",
+        mode="adaptive",
+        extra_params=(("vertex_location", "us-east5"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="vertex-claude-haiku-4-5",
+        model="vertex_ai/claude-haiku-4-5",
+        mode="budget",
+        extra_params=(("vertex_location", "us-east5"),),
+        required_env=_VERTEX_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+BEDROCK_CONVERSE_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="bedrock-claude-opus-4-7",
+        model="bedrock/converse/us.anthropic.claude-opus-4-7",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_OPUS_4_7,
+    ),
+    ModelEntry(
+        alias="bedrock-claude-opus-4-6",
+        model="bedrock/converse/us.anthropic.claude-opus-4-6-v1",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-claude-sonnet-4-6",
+        model="bedrock/converse/us.anthropic.claude-sonnet-4-6",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-claude-sonnet-4-5",
+        model="bedrock/converse/us.anthropic.claude-sonnet-4-5-20250929-v1:0",
+        mode="budget",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+BEDROCK_INVOKE_CHAT_MODELS: Tuple[ModelEntry, ...] = (
+    ModelEntry(
+        alias="bedrock-invoke-claude-opus-4-6",
+        model="bedrock/invoke/us.anthropic.claude-opus-4-6-v1",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-invoke-claude-sonnet-4-6",
+        model="bedrock/invoke/us.anthropic.claude-sonnet-4-6",
+        mode="adaptive",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_4_6,
+    ),
+    ModelEntry(
+        alias="bedrock-invoke-claude-opus-4-5",
+        model="bedrock/invoke/us.anthropic.claude-opus-4-5-20251101-v1:0",
+        mode="budget",
+        extra_params=(("aws_region_name", "us-east-1"),),
+        required_env=_BEDROCK_REQ,
+        caps=_CAPS_NONE,
+    ),
+)
+
+
+BEDROCK_INVOKE_MESSAGES_MODELS: Tuple[ModelEntry, ...] = BEDROCK_INVOKE_CHAT_MODELS
+
+
+@dataclass(frozen=True)
+class Route:
+    name: str
+    models: Tuple[ModelEntry, ...]
+
+
+ROUTES: Tuple[Route, ...] = (
+    Route("anthropic_direct", ANTHROPIC_DIRECT_MODELS),
+    Route("azure_ai", AZURE_AI_MODELS),
+    Route("vertex_ai", VERTEX_AI_MODELS),
+    Route("bedrock_converse", BEDROCK_CONVERSE_MODELS),
+    Route("bedrock_invoke_chat", BEDROCK_INVOKE_CHAT_MODELS),
+    Route("bedrock_invoke_messages", BEDROCK_INVOKE_MESSAGES_MODELS),
+)
+
+
+def all_cells() -> List[Tuple[str, ModelEntry, str, CellExpectation]]:
+    cells: List[Tuple[str, ModelEntry, str, CellExpectation]] = []
+    for route in ROUTES:
+        for model in route.models:
+            for effort in EFFORTS:
+                cells.append((route.name, model, effort, expected(model, effort)))
+    return cells

diff --git a/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py b/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py
new file mode 100644
--- /dev/null
+++ b/tests/test_litellm/reasoning_effort_grid_v4/test_grid_v4.py
@@ -1,0 +1,236 @@
+"""
+End-to-end grid v4 regression suite for reasoning_effort mapping across
+Anthropic-backed routes.
+
+Encodes the 21 (provider x model) x 11 effort matrix (231 cells) from the
+QA sweep on https://github.com/BerriAI/litellm/pull/27039#issuecomment-4363363610
+that the fix in https://github.com/BerriAI/litellm/pull/27074 was validated
+against. Each cell asserts:
+
+  - Wire body shape captured pre-call (thinking.type, output_config.effort,
+    thinking.budget_tokens, max_tokens) -- the regression signal for silent
+    drops/strips anywhere in the transformation chain.
+  - Status code returned by LiteLLM (200 vs BadRequestError -> 400) -- the
+    regression signal for clean-error vs leaked-500 mappings.
+
+Hits real provider endpoints. Each route is skipped at runtime when its
+required env vars are absent, so PR builds without provider credentials no-op
+gracefully.
+"""
+
+import json
+import os
+from typing import Any, Dict, List, Optional, Tuple
+
+import pytest
+
+import litellm
+from litellm.exceptions import BadRequestError
+
+from .grid_spec import (
+    OMIT,
+    ROUTES,
+    CellExpectation,
+    ModelEntry,
+    all_cells,
+)
+
+
+_PROMPT_MESSAGES: List[Dict[str, str]] = [
+    {"role": "user", "content": "Step by step, calculate 47 * 53. Show your work."}
+]
+
+
+def _required_env_missing(model: ModelEntry) -> Optional[str]:
+    missing = [key for key in model.required_env if not os.environ.get(key)]
+    if missing:
+        return "missing env: " + ", ".join(sorted(missing))
+    return None
+
+
+def _max_tokens_for(model: ModelEntry) -> int:
+    return 200 if model.mode == "adaptive" else 8192
+
+
+def _build_completion_kwargs(model: ModelEntry, effort: str) -> Dict[str, Any]:
+    kwargs: Dict[str, Any] = {
+        "model": model.model,
+        "messages": _PROMPT_MESSAGES,
+        "max_tokens": _max_tokens_for(model),
+    }
+    kwargs.update(model.params())
+    if effort != "__omit__":
+        kwargs["reasoning_effort"] = effort
+    if model.model.startswith("vertex_ai/"):
+        kwargs["vertex_project"] = os.environ.get(
+            "VERTEX_PROJECT", "vertex-check-481318"
+        )
+    if model.model.startswith("azure_ai/"):
+        kwargs["api_base"] = os.environ["AZURE_FOUNDRY_API_BASE"]
+        kwargs["api_key"] = os.environ["AZURE_FOUNDRY_API_KEY"]
+    return kwargs
+
+
+def _build_messages_kwargs(model: ModelEntry, effort: str) -> Dict[str, Any]:
+    kwargs = _build_completion_kwargs(model, effort)
+    return kwargs
+
+
+def _converse_subbody(body: Dict[str, Any]) -> Dict[str, Any]:
+    """Return the dict that holds thinking/output_config for a Converse wire body."""
+    return body.get("additionalModelRequestFields", body)
+
+
+def _max_tokens_from_body(body: Dict[str, Any], route_name: str) -> Optional[int]:
+    if route_name == "bedrock_converse":
+        return body.get("inferenceConfig", {}).get("maxTokens")
+    return body.get("max_tokens")
+
+
+def _assert_cell(
+    route_name: str,
+    body: Optional[Dict[str, Any]],
+    status: int,
+    cell: CellExpectation,
+) -> None:
+    assert status == cell.status, f"expected status={cell.status}, got status={status}"
+
+    if cell.status != 200:
+        # Bad-request paths short-circuit before the wire body matters.
+        return
+
+    assert body is not None, "wire body was not captured for a 200-status cell"
+    subbody = _converse_subbody(body) if route_name == "bedrock_converse" else body
+    thinking = subbody.get("thinking")
+    output_config = subbody.get("output_config")
+
+    if cell.thinking_type is OMIT:
+        assert thinking is None, f"expected thinking omitted, got {thinking!r}"
+    else:
+        assert thinking is not None, "expected thinking present, got omit"
+        assert thinking.get("type") == cell.thinking_type, (
+            f"expected thinking.type={cell.thinking_type!r}, "
+            f"got {thinking.get('type')!r}"
+        )
+
+    if cell.output_config_effort is OMIT:
+        assert (
+            output_config is None or "effort" not in output_config
+        ), f"expected output_config.effort omitted, got {output_config!r}"
+    else:
+        assert output_config is not None, (
+            f"expected output_config.effort={cell.output_config_effort!r}, "
+            "got output_config omitted"
+        )
+        assert output_config.get("effort") == cell.output_config_effort, (
+            f"expected output_config.effort={cell.output_config_effort!r}, "
+            f"got {output_config.get('effort')!r}"
+        )
+
+    if cell.thinking_budget_tokens is not OMIT:
+        assert thinking is not None
+        assert thinking.get("budget_tokens") == cell.thinking_budget_tokens, (
+            f"expected thinking.budget_tokens={cell.thinking_budget_tokens!r}, "
+            f"got {thinking.get('budget_tokens')!r}"
+        )
+
+    if cell.max_tokens is not OMIT:
+        wire_max = _max_tokens_from_body(body, route_name)
+        assert (
+            wire_max == cell.max_tokens
+        ), f"expected max_tokens={cell.max_tokens!r}, got {wire_max!r}"
+
+
+_PARAMS: List[Tuple[str, ModelEntry, str, CellExpectation]] = all_cells()
+
+
+def _cell_id(case: Tuple[str, ModelEntry, str, CellExpectation]) -> str:
+    route_name, model, effort, _ = case
+    effort_label = "__empty__" if effort == "" else effort
+    return f"{route_name}-{model.alias}-{effort_label}"
+
+
+_PARAM_IDS: List[str] = [_cell_id(case) for case in _PARAMS]
+
+
+async def _call_chat(model: ModelEntry, effort: str) -> Tuple[int, Optional[Exception]]:
+    kwargs = _build_completion_kwargs(model, effort)
+    try:
+        await litellm.acompletion(**kwargs)
+        return 200, None
+    except BadRequestError as exc:
+        return 400, exc
+    except Exception as exc:
+        return 500, exc
+
+
+async def _call_messages(
+    model: ModelEntry, effort: str
+) -> Tuple[int, Optional[Exception]]:
+    kwargs = _build_messages_kwargs(model, effort)
+    try:
+        await litellm.messages.acreate(**kwargs)
+        return 200, None
+    except BadRequestError as exc:
+        return 400, exc
+    except Exception as exc:
+        return 500, exc
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+    ("route_name", "model", "effort", "cell"), _PARAMS, ids=_PARAM_IDS
+)
+async def test_reasoning_effort_grid_v4(
+    route_name: str,
+    model: ModelEntry,
+    effort: str,
+    cell: CellExpectation,
+    wire_capture,
+) -> None:
+    skip_reason = _required_env_missing(model)
+    if skip_reason:
+        pytest.skip(skip_reason)
+
+    if route_name == "bedrock_invoke_messages":
+        status, exc = await _call_messages(model, effort)
+    else:
+        status, exc = await _call_chat(model, effort)
+
+    record = wire_capture.latest()
+    body = record["body"] if record else None
+    # Bedrock Converse logs `complete_input_dict` as a JSON string (see
+    # litellm/llms/bedrock/chat/converse_handler.py); parse it so the dict
+    # accessors in `_assert_cell` work uniformly across routes.
+    if route_name == "bedrock_converse" and isinstance(body, str):
+        body = json.loads(body)
+
+    try:
+        _assert_cell(route_name, body, status, cell)
+    except AssertionError:
+        if exc is not None:
+            raise AssertionError(
+                f"underlying exception ({type(exc).__name__}): {exc}"
+            ) from None
+        raise
+
+
+def test_grid_v4_cell_count() -> None:
+    """Guard against accidental drops or duplicates in the grid spec."""
+    assert len(_PARAMS) == 21 * 11, (
+        f"expected 231 cells (21 provider x model combos x 11 efforts), "
+        f"got {len(_PARAMS)}"
+    )
+
+
+def test_grid_v4_route_coverage() -> None:
+    """The grid must cover every route the original QA sweep covered."""
+    route_names = {route.name for route in ROUTES}
+    assert route_names == {
+        "anthropic_direct",
+        "azure_ai",
+        "vertex_ai",
+        "bedrock_converse",
+        "bedrock_invoke_chat",
+        "bedrock_invoke_messages",
+    }

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 4327427. Configure here.

Comment thread tests/llm_translation/reasoning_effort_grid/grid_spec.py
cursoragent and others added 3 commits May 16, 2026 03:12
…t cap

The runtime _validate_effort_for_model allows effort='max' for any
Claude 4.6 model (opus or sonnet), and model_prices_and_context_window
sets supports_max_reasoning_effort: true for claude-sonnet-4-6. The
grid spec previously gave sonnet-4-6 entries _CAPS_NONE, so expected()
returned status=400 for effort='max', which mismatched the runtime's
status=200 and caused 6 cells (one per route) to fail.

Rename _CAPS_OPUS_4_6 to _CAPS_4_6 (since the cap set is shared by
opus and sonnet 4.6) and assign it to all sonnet-4-6 entries.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
…on, drop v4 naming

- Drop the "v4" suffix throughout: it referred to the QA sweep iteration,
  not this test suite. There's only one regression suite, so just call it
  reasoning_effort_grid.
- Move tests/test_litellm/reasoning_effort_grid_v4/ -> tests/llm_translation/
  reasoning_effort_grid/. Two reasons:
    1. The parent tests/test_litellm/conftest.py installs an autouse fixture
       (isolate_host_aws_config) that clears every AWS_* env var before each
       test, which would silently skip every Bedrock cell.
    2. tests/llm_translation/conftest.py already wires up the Redis-backed
       VCR persister and auto-applies @pytest.mark.vcr to every collected
       item via apply_vcr_auto_marker_to_items. Living under that conftest
       means the suite gets cassette replay for free -- first CI run with
       provider creds records 231 cassettes, every subsequent run replays
       them with no live spend.
- Trim the suite's own conftest down to just the wire_capture fixture; the
  inherited llm_translation conftest covers the VCR plumbing.
- Drop the dedicated reasoning_effort_grid_v4_e2e CircleCI job. The existing
  llm_translation_testing job globs tests/llm_translation/**/test_*.py, so
  the suite is gated by an existing job with no new wiring.
@mateo-berri mateo-berri force-pushed the litellm_grid-v4-e2e-tests-cZRwz branch from 0c4fed4 to d084782 Compare May 16, 2026 03:22
@mateo-berri mateo-berri changed the title test(ci): add reasoning_effort grid v4 e2e regression suite test(ci): add reasoning_effort grid e2e regression suite May 16, 2026
… openapi field

Two CI failures, both pre-existing in different ways:

1. reasoning_effort_grid: all 33 bedrock_invoke_messages cells failed with
   AttributeError("module 'litellm' has no attribute 'messages'"). litellm
   exposes the async Anthropic Messages entrypoint as litellm.anthropic_messages
   (via "from .llms.anthropic.experimental_pass_through.messages.handler
   import *" in litellm/__init__.py), not litellm.messages.acreate. Swap
   the call.

2. tests/test_litellm/interactions/test_openapi_compliance.py::TestResponseCompliance::test_interaction_response_fields
   asserts the live Google spec contains "steps". Google's spec has churned
   through "outputs" -> "steps" -> neither, and presently carries neither.
   The test broke on main as soon as upstream dropped "steps"; pulling the
   key off the assert list realigns the test with the live schema. Re-add
   the per-turn output field once upstream stabilizes on a name.

The openapi-compliance fix doesn't belong to this PR conceptually but is
included here per request to unblock CI before the morning.
… not class

The anthropic_messages route wraps client-side BadRequestError as
AnthropicError (a BaseLLMException subclass) with status_code=400, so
"except BadRequestError" missed those cells and they fell through to the
generic Exception arm, returning 500 instead of the expected 400.

Replace the isinstance-on-BadRequestError check with a tiny classifier
that prefers BadRequestError membership, then falls back to the exception's
status_code attribute (set by every BaseLLMException subclass), then 500.
Apply to both _call_chat and _call_messages for consistency.

Fixes the 13 CircleCI llm_translation_testing failures on
bedrock_invoke_messages cells where the effort was disabled / invalid /
empty / xhigh-on-unsupported / max-on-unsupported.
Four pre-existing flakes on main that gate this branch's workflow even
though they're unrelated to the reasoning_effort_grid suite:

1. tests/local_testing/test_completion.py::test_completion_fireworks_ai
2. tests/local_testing/test_completion_cost.py::test_completion_cost_fireworks_ai[fireworks_ai/llama-v3p3-70b-instruct]
3. tests/llm_translation/test_fireworks_ai_translation.py::test_document_inlining_example[False]

   The Fireworks-hosted `llama-v3p3-70b-instruct` deployment is currently
   returning 404 "Model not found, inaccessible, and/or not deployed".
   These tests pass when the model is deployed; the issue is upstream
   capacity, not our code path. Wrap the live call in a try/except that
   pytest.skip's on litellm.NotFoundError so a Fireworks deployment hiccup
   no longer fails CI for unrelated PRs.

4. tests/llm_translation/test_gemini.py::test_gemini_image_size_limit_exceeded

   The test fetches the 32MB "Blue Marble 2002" image from Wikimedia to
   exercise the 50MB image-size cap. CI runners share an IP pool with
   noisy traffic, so Wikimedia routinely returns HTTP 429. The size-limit
   check never gets a chance to fire. Catch the 429 BadRequestError and
   pytest.skip in that case.

None of these belong on this PR conceptually, but they're included per
request to unblock the workflow before morning.
…ageFetchError

litellm.ImageFetchError is a subclass of BadRequestError, so when
Wikimedia returns 429 the pytest.raises(ImageFetchError) block matches
and swallows the exception -- the outer try/except never fires. Drop the
try/except and check the captured error message for "Status code: 429"
after the raises block, calling pytest.skip in that case. Same intent,
right control flow.
@mateo-berri mateo-berri marked this pull request as ready for review May 16, 2026 15:04
@greptile-apps

greptile-apps Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Converts a 231-cell manual QA grid sweep for reasoning_effort into an automated VCR-backed regression suite, encoding post-fix expectations as a rule-driven matrix in grid_spec.py and running them parametrically against six Anthropic-backed routes. A handful of existing tests also gain pytest.skip guards for upstream flakiness (Fireworks 404, Wikimedia 429) and one OpenAPI compliance assertion is removed pending upstream schema stabilization.

  • New reasoning_effort_grid suite (grid_spec.py, conftest.py, test_reasoning_effort_grid.py): 231-cell async parametrized test asserting wire-body shape and HTTP status for every (model, effort) pair; inherits Redis VCR replay from the parent conftest.py so only the first recording pass hits live endpoints.
  • Fireworks/Gemini/OpenAPI hardening: three existing test files add graceful skip guards for transient upstream errors and remove a per-turn output field assertion that Google's live spec has stopped providing.

Confidence Score: 5/5

Safe to merge — changes are entirely test-side with no production code impact.

All changes are in the test layer. The new regression suite is well-structured, skips cleanly when credentials are absent, and inherits existing VCR infrastructure. The only notable concern is that model capability caps are hardcoded in grid_spec.py rather than derived from the production config JSON, which can cause the test oracle to drift if capabilities are updated upstream — but this does not affect production behavior.

tests/llm_translation/reasoning_effort_grid/grid_spec.py — the hardcoded CAPS* constants should ideally be derived from get_model_info() to avoid oracle drift.

Important Files Changed

Filename Overview
tests/llm_translation/reasoning_effort_grid/conftest.py Adds wire_capture fixture using a CustomLogger pre-call hook; correctly restores litellm.callbacks in a finally block.
tests/llm_translation/reasoning_effort_grid/grid_spec.py Encodes expected cells for the 231-cell matrix; capability caps for xhigh/max are hardcoded rather than read from the production model-config JSON, risking oracle drift.
tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py Parametrized 231-cell async test suite with wire-body and status assertions; skip guards, VCR plumbing, and error classification are all handled correctly.
tests/test_litellm/interactions/test_openapi_compliance.py Removes assertion on the steps field from Google's live schema; reduces test coverage for a field whose upstream name is currently unstable.
tests/llm_translation/test_fireworks_ai_translation.py Adds pytest.skip on NotFoundError to handle upstream model unavailability; preserves existing assertion logic.
tests/llm_translation/test_gemini.py Reformats long assertion lines and adds a pytest.skip guard for Wikimedia 429 rate-limits in the image-size test; substantive assertions unchanged.
tests/local_testing/test_completion.py Adds pytest.skip for NotFoundError on Fireworks completion test; existing test logic unmodified.
tests/local_testing/test_completion_cost.py Wraps Fireworks completion call in try/except to skip on NotFoundError; cost assertion logic preserved.

Reviews (2): Last reviewed commit: "refactor(reasoning_effort_grid): tighten..." | Re-trigger Greptile

Comment thread tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py Outdated
Comment thread tests/llm_translation/reasoning_effort_grid/test_reasoning_effort_grid.py Outdated
…view

Two P2 nits flagged by Greptile on PR 28036:

1. _build_completion_kwargs() defaulted vertex_project to "vertex-check-481318"
   when VERTEX_PROJECT was unset. That value is a specific GCP project that
   doesn't belong to this repo, so if the env-var skip guard were ever
   bypassed (misconfig, direct helper call), the test would silently issue
   calls to a foreign project rather than failing loudly. Drop the fallback
   and read os.environ["VERTEX_PROJECT"] directly, mirroring how
   AZURE_FOUNDRY_* are handled.

2. _build_messages_kwargs() was a one-liner that returned the result of
   _build_completion_kwargs() unchanged -- a dead abstraction with one
   caller. Inline at the _call_messages call site and delete the helper.
@mateo-berri

Copy link
Copy Markdown
Collaborator Author

@greptileai

@mateo-berri mateo-berri requested a review from yuneng-berri May 16, 2026 15:26
…s-cZRwz

Resolve conflicts in the five unrelated CI-flake fixes I previously landed
on this branch -- staging shipped stronger versions (mocked HTTP for the
Fireworks tests, mocked image-fetch for the Gemini size-limit test, switched
the openapi-compliance test to the Interaction response schema instead of
dropping the assertion). Take staging's version of all five files and drop
my now-unreachable 429-skip lines from the Gemini test that the auto-merge
left behind.
@yuneng-berri yuneng-berri merged commit 57e5e4a into litellm_internal_staging May 16, 2026
115 checks passed
@yuneng-berri yuneng-berri deleted the litellm_grid-v4-e2e-tests-cZRwz branch May 16, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants