Skip to content

test(reasoning_effort): add cross-provider wire-translation matrix#28046

Closed
yuneng-berri wants to merge 2 commits into
litellm_internal_stagingfrom
litellm_/optimistic-diffie-f69568
Closed

test(reasoning_effort): add cross-provider wire-translation matrix#28046
yuneng-berri wants to merge 2 commits into
litellm_internal_stagingfrom
litellm_/optimistic-diffie-f69568

Conversation

@yuneng-berri

@yuneng-berri yuneng-berri commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Relevant issues

Follow-up regression coverage for the reasoning_effort mapping bug class behind #27039 (reasoning_effort="none" NoneType crash) and #27074 (output_config.effort strips + ValueError/500 on garbage effort). Automated successor to the manual 231-cell QA sweep on #27039.

What this PR does

Adds a parametrized, fully-offline matrix under tests/llm_translation/reasoning_effort_matrix/ covering 5 routes × Claude models × 11 effort values × 2 entrypoints (/chat/completions and /v1/messages).

Two always-on offline oracles (the per-PR regression net, no network/creds):

One CI-gated oracle (dormant off-CI, no new CI config — runs in the existing llm_translation job via Redis-VCR):

  • Provider acceptance — replays/records the real Anthropic round-trip and asserts Anthropic accepts the request LiteLLM built. Skipped locally; activates on CASSETTE_REDIS_URL.

Coverage is per-provider idiomatic: map_openai_params + transform_request for Anthropic / Bedrock Converse (incl. additionalModelRequestFields placement) / Bedrock Invoke / Vertex / Azure, and the *MessagesConfig transforms for the /v1/messages routes. The expected table is hand-authored and independent of model_prices/transformation code, so it catches data-layer bugs too. A self-contained pytest_terminal_summary hook reprints failures as the sweep's at-a-glance route × effort grid.

The model→tier collapse (adaptive_full = opus-4-7; adaptive_max_only = opus-4-6 / sonnet-4-6; budget = 4.5 family + haiku-4-5) is documented in spec.py.

Testing

tests/llm_translation/reasoning_effort_matrix/353 passed, 84 skipped offline (the 84 are the CI-gated live cells, correctly dormant off-CI). Failure-grid rendering verified by a forced-failure smoke run. black + ruff clean.

Type

🧪 Tests


Note

Low Risk
Low risk because this PR only adds new tests and pytest hooks; the main risk is added CI runtime/flakiness from the CI-gated live provider-acceptance test.

Overview
Adds a new reasoning_effort wire-translation matrix test suite that parametrizes across multiple Anthropic routes/providers, Claude model tiers, and effort values to assert the exact thinking/output_config subtree sent on the wire or a clean client-side 400 for invalid efforts.

Introduces a staleness canary that fails if model_prices_and_context_window.json gains reasoning-capable Claude families not mapped in the matrix, plus a CI/VCR-gated live Anthropic round-trip acceptance check and a pytest_terminal_summary hook that reprints failures as an at-a-glance matrix grid.

Reviewed by Cursor Bugbot for commit 5ead905. Bugbot is set up for automated code reviews on this repo. Configure here.

Automated successor to the manual 231-cell QA sweep behind #27039 / #27074.

Parametrized matrix over 5 routes x models x 11 efforts x 2 entrypoints
(/chat/completions + /v1/messages). Two always-on offline oracles:

- wire-request: assert the reasoning subtree (thinking + output_config)
  LiteLLM builds matches a hand-authored (tier x effort) spec, value AND
  presence/absence; client-rejected efforts must raise a clean 400-class
  error (not a bare ValueError/500).
- staleness canary: fail loud when model_prices grows a reasoning-capable
  Claude family the matrix does not cover.

Plus a CI-gated, VCR-recorded provider-acceptance test (Anthropic route)
that confirms Anthropic accepts the request LiteLLM builds; dormant
off-CI. Self-contained failure-grid renderer reproduces the sweep's
at-a-glance view.
@yuneng-berri yuneng-berri requested a review from mateo-berri May 16, 2026 07:45
@codecov

codecov Bot commented May 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5ead905. Configure here.

"4-sonnet",
"4-opus",
}
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canary allowlist tokens silently suppress future model detection

Medium Severity

The CANARY_ALLOWLIST entries "opus-4" and "sonnet-4" are overly broad substrings. In unmapped_reasoning_claude_families, the any(token in name for token in known) check will silently match new versioned models like claude-opus-4-X or claude-sonnet-4-X. This defeats the canary's purpose of flagging new, unmapped reasoning-capable Claude families, which was designed to prevent recurrence of past issues.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5ead905. Configure here.

@greptile-apps

greptile-apps Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a parametrized, fully-offline regression matrix under tests/llm_translation/reasoning_effort_matrix/ covering reasoning_effort wire-translation across 5 provider routes, 6 Claude models, and 11 effort values to prevent recurrence of the bugs fixed in #27039 and #27074.

  • Wire-request oracle (test_wire_request): asserts the thinking + output_config subtree built by each per-provider transform matches a hand-authored spec by value and presence/absence, with clean 400-class errors for invalid efforts.
  • Staleness canary (test_no_unmapped_reasoning_models): scans model_prices_and_context_window.json for reasoning-capable Claude families absent from the matrix.
  • Live provider-acceptance oracle (test_provider_accepts_built_request): CI-gated VCR-recorded Anthropic round-trip, dormant locally.

Confidence Score: 3/5

Three concrete issues undermine the reliability of the regression net: the Bedrock Converse adapter reads thinking from a pre-transform dict (missing wire-placement regressions), the canary allowlist silently whitelists future opus-4-x/sonnet-4-x variants, and the live test violates the folder no-network-calls policy.

The Bedrock Converse adapter gives false positives for thinking-placement bugs; the staleness canary would silently miss new minor-version Claude models; the live test violates the tests/llm_translation/ no-network-calls policy. All three issues directly undermine the regression coverage this PR aims to establish.

spec.py has both the Bedrock Converse adapter bug and the canary allowlist overmatch; test_reasoning_effort_wire_matrix.py contains the live-call policy violation.

Important Files Changed

Filename Overview
tests/llm_translation/reasoning_effort_matrix/spec.py Hand-authored expected table and per-route adapters. Two issues: Bedrock Converse adapter reads thinking from pre-transform op rather than from post-transform additionalModelRequestFields, creating a false-positive coverage gap; and the staleness canary's substring allowlist silently matches future minor-version models, defeating its stated purpose.
tests/llm_translation/reasoning_effort_matrix/test_reasoning_effort_wire_matrix.py Parametrized wire-matrix test with offline and CI-gated live oracles. The live test makes real Anthropic API calls, violating the folder rule that prohibits real network calls.
tests/llm_translation/reasoning_effort_matrix/conftest.py Self-contained pytest hook for grid-formatted failure reporting and CI gate logic. Clean and correct.
tests/llm_translation/reasoning_effort_matrix/init.py Empty package marker, no issues.

Comments Outside Diff (1)

  1. tests/llm_translation/reasoning_effort_matrix/test_reasoning_effort_wire_matrix.py, line 822-861 (link)

    P1 Real network calls in tests/llm_translation/ folder

    test_provider_accepts_built_request calls litellm.completion() and litellm.anthropic_messages() — real HTTP calls to the Anthropic API. The custom rule for tests/llm_translation/ explicitly prohibits real network calls and requires only mock/offline tests. Even though this test is gated by ci_record_enabled() and skipped locally, it fires live API calls during CI cassette recording, directly violating the "only mock tests" policy for this folder.

    Rule Used: What: prevent any tests from being added here that... (source)

Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile

additional = result.get("additionalModelRequestFields") or {}
# thinking value comes from the post-map params (what the PR test asserts);
# output_config placement is the #27074-item-1 wire regression.
return _normalize(op.get("thinking"), additional.get("output_config"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Bedrock Converse adapter reads thinking from pre-transform op, not the wire payload

_capture_bedrock_converse_chat reads output_config from result["additionalModelRequestFields"] (the actual wire format) but reads thinking from op — the optional-params dict produced by map_openai_params, before _transform_request runs. Tracing through the production code: thinking is not in AmazonConverseConfig.__annotations__, so _prepare_request_params routes it into additional_request_params, which becomes additionalModelRequestFields. Reading from op means a regression where _transform_request drops or misplaces thinking from additionalModelRequestFields would pass this test undetected — the exact class of wire-placement bugs the matrix exists to catch.

Suggested change
return _normalize(op.get("thinking"), additional.get("output_config"))
return _normalize(additional.get("thinking"), additional.get("output_config"))

Comment on lines +214 to +225
# --------------------------------------------------------------------------- #

CHAT = "chat_completions"
MESSAGES = "v1_messages"

MESSAGES_ENTRYPOINT = "messages"


@dataclass(frozen=True)
class Route:
name: str
# entrypoint -> ordered list of (wire_model_id, canonical_tier_key)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Staleness canary — substring tokens "opus-4" and "sonnet-4" are too broad

unmapped_reasoning_claude_families uses any(token in name for token in known) as a substring check. known includes both the MODEL_TIER keys (e.g. "opus-4-5", "opus-4-6", "opus-4-7") and the CANARY_ALLOWLIST entries (e.g. "opus-4", "sonnet-4"). The short allowlist tokens are substrings of future model names: a new model "claude-opus-4-8-..." would match "opus-4" from the allowlist and be silently skipped — never triggering the canary — even though it isn’t in MODEL_TIER and its tier is unknown. The canary would fail to alert on exactly the per-model recurrence it was designed to prevent.

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the greptile P1's

@mateo-berri mateo-berri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the greptile P1's

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants