test(reasoning_effort): add cross-provider wire-translation matrix by yuneng-berri · Pull Request #28046 · BerriAI/litellm

yuneng-berri · 2026-05-16T07:45:38Z

Relevant issues

Follow-up regression coverage for the reasoning_effort mapping bug class behind #27039 (reasoning_effort="none" NoneType crash) and #27074 (output_config.effort strips + ValueError/500 on garbage effort). Automated successor to the manual 231-cell QA sweep on #27039.

What this PR does

Adds a parametrized, fully-offline matrix under tests/llm_translation/reasoning_effort_matrix/ covering 5 routes × Claude models × 11 effort values × 2 entrypoints (/chat/completions and /v1/messages).

Two always-on offline oracles (the per-PR regression net, no network/creds):

Wire request — asserts the reasoning subtree (thinking + output_config) LiteLLM builds matches a hand-authored (tier × effort) spec, by value AND presence/absence (catches silent strips). Client-rejected efforts must raise a clean 400-class error (BadRequestError / AnthropicError@400), never a bare ValueError/500.
Staleness canary — fails loudly when model_prices grows a reasoning-capable Claude family the matrix does not cover, so the per-model recurrence (fix(anthropic,bedrock): omit thinking/output_config when reasoning_effort="none" #27039 → fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074) can't slip through silently.

One CI-gated oracle (dormant off-CI, no new CI config — runs in the existing llm_translation job via Redis-VCR):

Provider acceptance — replays/records the real Anthropic round-trip and asserts Anthropic accepts the request LiteLLM built. Skipped locally; activates on CASSETTE_REDIS_URL.

Coverage is per-provider idiomatic: map_openai_params + transform_request for Anthropic / Bedrock Converse (incl. additionalModelRequestFields placement) / Bedrock Invoke / Vertex / Azure, and the *MessagesConfig transforms for the /v1/messages routes. The expected table is hand-authored and independent of model_prices/transformation code, so it catches data-layer bugs too. A self-contained pytest_terminal_summary hook reprints failures as the sweep's at-a-glance route × effort grid.

The model→tier collapse (adaptive_full = opus-4-7; adaptive_max_only = opus-4-6 / sonnet-4-6; budget = 4.5 family + haiku-4-5) is documented in spec.py.

Testing

tests/llm_translation/reasoning_effort_matrix/ — 353 passed, 84 skipped offline (the 84 are the CI-gated live cells, correctly dormant off-CI). Failure-grid rendering verified by a forced-failure smoke run. black + ruff clean.

Type

🧪 Tests

Note

Low Risk
Low risk because this PR only adds new tests and pytest hooks; the main risk is added CI runtime/flakiness from the CI-gated live provider-acceptance test.

Overview
Adds a new reasoning_effort wire-translation matrix test suite that parametrizes across multiple Anthropic routes/providers, Claude model tiers, and effort values to assert the exact thinking/output_config subtree sent on the wire or a clean client-side 400 for invalid efforts.

Introduces a staleness canary that fails if model_prices_and_context_window.json gains reasoning-capable Claude families not mapped in the matrix, plus a CI/VCR-gated live Anthropic round-trip acceptance check and a pytest_terminal_summary hook that reprints failures as an at-a-glance matrix grid.

^{Reviewed by Cursor Bugbot for commit 5ead905. Bugbot is set up for automated code reviews on this repo. Configure here.}

Automated successor to the manual 231-cell QA sweep behind #27039 / #27074. Parametrized matrix over 5 routes x models x 11 efforts x 2 entrypoints (/chat/completions + /v1/messages). Two always-on offline oracles: - wire-request: assert the reasoning subtree (thinking + output_config) LiteLLM builds matches a hand-authored (tier x effort) spec, value AND presence/absence; client-rejected efforts must raise a clean 400-class error (not a bare ValueError/500). - staleness canary: fail loud when model_prices grows a reasoning-capable Claude family the matrix does not cover. Plus a CI-gated, VCR-recorded provider-acceptance test (Anthropic route) that confirms Anthropic accepts the request LiteLLM builds; dormant off-CI. Self-contained failure-grid renderer reproduces the sweep's at-a-glance view.

…itellm_/optimistic-diffie-f69568

codecov · 2026-05-16T07:49:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

cursor

Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5ead905. Configure here.}

cursor · 2026-05-16T07:51:53Z

+        "4-sonnet",
+        "4-opus",
+    }
+)


Canary allowlist tokens silently suppress future model detection

Medium Severity

The CANARY_ALLOWLIST entries "opus-4" and "sonnet-4" are overly broad substrings. In unmapped_reasoning_claude_families, the any(token in name for token in known) check will silently match new versioned models like claude-opus-4-X or claude-sonnet-4-X. This defeats the canary's purpose of flagging new, unmapped reasoning-capable Claude families, which was designed to prevent recurrence of past issues.

^{Reviewed by Cursor Bugbot for commit 5ead905. Configure here.}

greptile-apps · 2026-05-16T07:52:18Z

Greptile Summary

This PR adds a parametrized, fully-offline regression matrix under tests/llm_translation/reasoning_effort_matrix/ covering reasoning_effort wire-translation across 5 provider routes, 6 Claude models, and 11 effort values to prevent recurrence of the bugs fixed in #27039 and #27074.

Wire-request oracle (test_wire_request): asserts the thinking + output_config subtree built by each per-provider transform matches a hand-authored spec by value and presence/absence, with clean 400-class errors for invalid efforts.
Staleness canary (test_no_unmapped_reasoning_models): scans model_prices_and_context_window.json for reasoning-capable Claude families absent from the matrix.
Live provider-acceptance oracle (test_provider_accepts_built_request): CI-gated VCR-recorded Anthropic round-trip, dormant locally.

Confidence Score: 3/5

Three concrete issues undermine the reliability of the regression net: the Bedrock Converse adapter reads thinking from a pre-transform dict (missing wire-placement regressions), the canary allowlist silently whitelists future opus-4-x/sonnet-4-x variants, and the live test violates the folder no-network-calls policy.

The Bedrock Converse adapter gives false positives for thinking-placement bugs; the staleness canary would silently miss new minor-version Claude models; the live test violates the tests/llm_translation/ no-network-calls policy. All three issues directly undermine the regression coverage this PR aims to establish.

spec.py has both the Bedrock Converse adapter bug and the canary allowlist overmatch; test_reasoning_effort_wire_matrix.py contains the live-call policy violation.

Important Files Changed

Filename	Overview
tests/llm_translation/reasoning_effort_matrix/spec.py	Hand-authored expected table and per-route adapters. Two issues: Bedrock Converse adapter reads `thinking` from pre-transform `op` rather than from post-transform `additionalModelRequestFields`, creating a false-positive coverage gap; and the staleness canary's substring allowlist silently matches future minor-version models, defeating its stated purpose.
tests/llm_translation/reasoning_effort_matrix/test_reasoning_effort_wire_matrix.py	Parametrized wire-matrix test with offline and CI-gated live oracles. The live test makes real Anthropic API calls, violating the folder rule that prohibits real network calls.
tests/llm_translation/reasoning_effort_matrix/conftest.py	Self-contained pytest hook for grid-formatted failure reporting and CI gate logic. Clean and correct.
tests/llm_translation/reasoning_effort_matrix/init.py	Empty package marker, no issues.

Comments Outside Diff (1)

tests/llm_translation/reasoning_effort_matrix/test_reasoning_effort_wire_matrix.py, line 822-861 (link)

Real network calls in tests/llm_translation/ folder

test_provider_accepts_built_request calls litellm.completion() and litellm.anthropic_messages() — real HTTP calls to the Anthropic API. The custom rule for tests/llm_translation/ explicitly prohibits real network calls and requires only mock/offline tests. Even though this test is gated by ci_record_enabled() and skipped locally, it fires live API calls during CI cassette recording, directly violating the "only mock tests" policy for this folder.

Rule Used: What: prevent any tests from being added here that... (source)

_{Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile}

greptile-apps · 2026-05-16T07:52:25Z

+    additional = result.get("additionalModelRequestFields") or {}
+    # thinking value comes from the post-map params (what the PR test asserts);
+    # output_config placement is the #27074-item-1 wire regression.
+    return _normalize(op.get("thinking"), additional.get("output_config"))


Bedrock Converse adapter reads thinking from pre-transform op, not the wire payload

_capture_bedrock_converse_chat reads output_config from result["additionalModelRequestFields"] (the actual wire format) but reads thinking from op — the optional-params dict produced by map_openai_params, before _transform_request runs. Tracing through the production code: thinking is not in AmazonConverseConfig.__annotations__, so _prepare_request_params routes it into additional_request_params, which becomes additionalModelRequestFields. Reading from op means a regression where _transform_request drops or misplaces thinking from additionalModelRequestFields would pass this test undetected — the exact class of wire-placement bugs the matrix exists to catch.

Suggested change

return _normalize(op.get("thinking"), additional.get("output_config"))

return _normalize(additional.get("thinking"), additional.get("output_config"))

greptile-apps · 2026-05-16T07:52:26Z

+# --------------------------------------------------------------------------- #
+
+CHAT = "chat_completions"
+MESSAGES = "v1_messages"
+
+MESSAGES_ENTRYPOINT = "messages"
+
+
+@dataclass(frozen=True)
+class Route:
+    name: str
+    # entrypoint -> ordered list of (wire_model_id, canonical_tier_key)


Staleness canary — substring tokens "opus-4" and "sonnet-4" are too broad

unmapped_reasoning_claude_families uses any(token in name for token in known) as a substring check. known includes both the MODEL_TIER keys (e.g. "opus-4-5", "opus-4-6", "opus-4-7") and the CANARY_ALLOWLIST entries (e.g. "opus-4", "sonnet-4"). The short allowlist tokens are substrings of future model names: a new model "claude-opus-4-8-..." would match "opus-4" from the allowlist and be silently skipped — never triggering the canary — even though it isn’t in MODEL_TIER and its tier is unknown. The canary would fail to alert on exactly the per-model recurrence it was designed to prevent.

mateo-berri

Please fix the greptile P1's

mateo-berri

Please fix the greptile P1's

yuneng-berri added 2 commits May 16, 2026 00:44

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

5ead905

…itellm_/optimistic-diffie-f69568

yuneng-berri requested a review from mateo-berri May 16, 2026 07:45

cursor Bot reviewed May 16, 2026

View reviewed changes

greptile-apps Bot reviewed May 16, 2026

View reviewed changes

mateo-berri requested changes May 16, 2026

View reviewed changes

yuneng-berri closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(reasoning_effort): add cross-provider wire-translation matrix#28046

test(reasoning_effort): add cross-provider wire-translation matrix#28046
yuneng-berri wants to merge 2 commits into
litellm_internal_stagingfrom
litellm_/optimistic-diffie-f69568

yuneng-berri commented May 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

codecov Bot commented May 16, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 16, 2026

Uh oh!

greptile-apps Bot commented May 16, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

mateo-berri left a comment

Uh oh!

mateo-berri left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return _normalize(op.get("thinking"), additional.get("output_config"))
	return _normalize(additional.get("thinking"), additional.get("output_config"))

Uh oh!

Conversation

yuneng-berri commented May 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

What this PR does

Testing

Type

Uh oh!

codecov Bot commented May 16, 2026

Codecov Report

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 16, 2026

Choose a reason for hiding this comment

Canary allowlist tokens silently suppress future model detection

Uh oh!

greptile-apps Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

mateo-berri left a comment

Choose a reason for hiding this comment

Uh oh!

mateo-berri left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuneng-berri commented May 16, 2026 •

edited by cursor Bot

Loading

greptile-apps Bot commented May 16, 2026 •

edited

Loading