test(reasoning_effort): add cross-provider wire-translation matrix#28046
test(reasoning_effort): add cross-provider wire-translation matrix#28046yuneng-berri wants to merge 2 commits into
Conversation
Automated successor to the manual 231-cell QA sweep behind #27039 / #27074. Parametrized matrix over 5 routes x models x 11 efforts x 2 entrypoints (/chat/completions + /v1/messages). Two always-on offline oracles: - wire-request: assert the reasoning subtree (thinking + output_config) LiteLLM builds matches a hand-authored (tier x effort) spec, value AND presence/absence; client-rejected efforts must raise a clean 400-class error (not a bare ValueError/500). - staleness canary: fail loud when model_prices grows a reasoning-capable Claude family the matrix does not cover. Plus a CI-gated, VCR-recorded provider-acceptance test (Anthropic route) that confirms Anthropic accepts the request LiteLLM builds; dormant off-CI. Self-contained failure-grid renderer reproduces the sweep's at-a-glance view.
…itellm_/optimistic-diffie-f69568
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high mode and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5ead905. Configure here.
| "4-sonnet", | ||
| "4-opus", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Canary allowlist tokens silently suppress future model detection
Medium Severity
The CANARY_ALLOWLIST entries "opus-4" and "sonnet-4" are overly broad substrings. In unmapped_reasoning_claude_families, the any(token in name for token in known) check will silently match new versioned models like claude-opus-4-X or claude-sonnet-4-X. This defeats the canary's purpose of flagging new, unmapped reasoning-capable Claude families, which was designed to prevent recurrence of past issues.
Reviewed by Cursor Bugbot for commit 5ead905. Configure here.
Greptile SummaryThis PR adds a parametrized, fully-offline regression matrix under
Confidence Score: 3/5Three concrete issues undermine the reliability of the regression net: the Bedrock Converse adapter reads The Bedrock Converse adapter gives false positives for thinking-placement bugs; the staleness canary would silently miss new minor-version Claude models; the live test violates the tests/llm_translation/ no-network-calls policy. All three issues directly undermine the regression coverage this PR aims to establish. spec.py has both the Bedrock Converse adapter bug and the canary allowlist overmatch; test_reasoning_effort_wire_matrix.py contains the live-call policy violation.
|
| Filename | Overview |
|---|---|
| tests/llm_translation/reasoning_effort_matrix/spec.py | Hand-authored expected table and per-route adapters. Two issues: Bedrock Converse adapter reads thinking from pre-transform op rather than from post-transform additionalModelRequestFields, creating a false-positive coverage gap; and the staleness canary's substring allowlist silently matches future minor-version models, defeating its stated purpose. |
| tests/llm_translation/reasoning_effort_matrix/test_reasoning_effort_wire_matrix.py | Parametrized wire-matrix test with offline and CI-gated live oracles. The live test makes real Anthropic API calls, violating the folder rule that prohibits real network calls. |
| tests/llm_translation/reasoning_effort_matrix/conftest.py | Self-contained pytest hook for grid-formatted failure reporting and CI gate logic. Clean and correct. |
| tests/llm_translation/reasoning_effort_matrix/init.py | Empty package marker, no issues. |
Comments Outside Diff (1)
-
tests/llm_translation/reasoning_effort_matrix/test_reasoning_effort_wire_matrix.py, line 822-861 (link)Real network calls in
tests/llm_translation/foldertest_provider_accepts_built_requestcallslitellm.completion()andlitellm.anthropic_messages()— real HTTP calls to the Anthropic API. The custom rule fortests/llm_translation/explicitly prohibits real network calls and requires only mock/offline tests. Even though this test is gated byci_record_enabled()and skipped locally, it fires live API calls during CI cassette recording, directly violating the "only mock tests" policy for this folder.Rule Used: What: prevent any tests from being added here that... (source)
Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile
| additional = result.get("additionalModelRequestFields") or {} | ||
| # thinking value comes from the post-map params (what the PR test asserts); | ||
| # output_config placement is the #27074-item-1 wire regression. | ||
| return _normalize(op.get("thinking"), additional.get("output_config")) |
There was a problem hiding this comment.
Bedrock Converse adapter reads
thinking from pre-transform op, not the wire payload
_capture_bedrock_converse_chat reads output_config from result["additionalModelRequestFields"] (the actual wire format) but reads thinking from op — the optional-params dict produced by map_openai_params, before _transform_request runs. Tracing through the production code: thinking is not in AmazonConverseConfig.__annotations__, so _prepare_request_params routes it into additional_request_params, which becomes additionalModelRequestFields. Reading from op means a regression where _transform_request drops or misplaces thinking from additionalModelRequestFields would pass this test undetected — the exact class of wire-placement bugs the matrix exists to catch.
| return _normalize(op.get("thinking"), additional.get("output_config")) | |
| return _normalize(additional.get("thinking"), additional.get("output_config")) |
| # --------------------------------------------------------------------------- # | ||
|
|
||
| CHAT = "chat_completions" | ||
| MESSAGES = "v1_messages" | ||
|
|
||
| MESSAGES_ENTRYPOINT = "messages" | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class Route: | ||
| name: str | ||
| # entrypoint -> ordered list of (wire_model_id, canonical_tier_key) |
There was a problem hiding this comment.
Staleness canary — substring tokens "opus-4" and "sonnet-4" are too broad
unmapped_reasoning_claude_families uses any(token in name for token in known) as a substring check. known includes both the MODEL_TIER keys (e.g. "opus-4-5", "opus-4-6", "opus-4-7") and the CANARY_ALLOWLIST entries (e.g. "opus-4", "sonnet-4"). The short allowlist tokens are substrings of future model names: a new model "claude-opus-4-8-..." would match "opus-4" from the allowlist and be silently skipped — never triggering the canary — even though it isn’t in MODEL_TIER and its tier is unknown. The canary would fail to alert on exactly the per-model recurrence it was designed to prevent.
mateo-berri
left a comment
There was a problem hiding this comment.
Please fix the greptile P1's
mateo-berri
left a comment
There was a problem hiding this comment.
Please fix the greptile P1's


Relevant issues
Follow-up regression coverage for the
reasoning_effortmapping bug class behind #27039 (reasoning_effort="none"NoneTypecrash) and #27074 (output_config.effortstrips +ValueError/500 on garbage effort). Automated successor to the manual 231-cell QA sweep on #27039.What this PR does
Adds a parametrized, fully-offline matrix under
tests/llm_translation/reasoning_effort_matrix/covering 5 routes × Claude models × 11 effort values × 2 entrypoints (/chat/completionsand/v1/messages).Two always-on offline oracles (the per-PR regression net, no network/creds):
thinking+output_config) LiteLLM builds matches a hand-authored(tier × effort)spec, by value AND presence/absence (catches silent strips). Client-rejected efforts must raise a clean 400-class error (BadRequestError/AnthropicError@400), never a bareValueError/500.model_pricesgrows a reasoning-capable Claude family the matrix does not cover, so the per-model recurrence (fix(anthropic,bedrock): omit thinking/output_config when reasoning_effort="none" #27039 → fix(anthropic,bedrock,vertex): forward output_config.effort + 400 on garbage reasoning_effort #27074) can't slip through silently.One CI-gated oracle (dormant off-CI, no new CI config — runs in the existing
llm_translationjob via Redis-VCR):CASSETTE_REDIS_URL.Coverage is per-provider idiomatic:
map_openai_params+transform_requestfor Anthropic / Bedrock Converse (incl.additionalModelRequestFieldsplacement) / Bedrock Invoke / Vertex / Azure, and the*MessagesConfigtransforms for the/v1/messagesroutes. The expected table is hand-authored and independent ofmodel_prices/transformation code, so it catches data-layer bugs too. A self-containedpytest_terminal_summaryhook reprints failures as the sweep's at-a-glance route × effort grid.The model→tier collapse (
adaptive_full= opus-4-7;adaptive_max_only= opus-4-6 / sonnet-4-6;budget= 4.5 family + haiku-4-5) is documented inspec.py.Testing
tests/llm_translation/reasoning_effort_matrix/—353 passed, 84 skippedoffline (the 84 are the CI-gated live cells, correctly dormant off-CI). Failure-grid rendering verified by a forced-failure smoke run.black+ruffclean.Type
🧪 Tests
Note
Low Risk
Low risk because this PR only adds new tests and pytest hooks; the main risk is added CI runtime/flakiness from the CI-gated live provider-acceptance test.
Overview
Adds a new
reasoning_effortwire-translation matrix test suite that parametrizes across multiple Anthropic routes/providers, Claude model tiers, and effort values to assert the exactthinking/output_configsubtree sent on the wire or a clean client-side 400 for invalid efforts.Introduces a staleness canary that fails if
model_prices_and_context_window.jsongains reasoning-capable Claude families not mapped in the matrix, plus a CI/VCR-gated live Anthropic round-trip acceptance check and apytest_terminal_summaryhook that reprints failures as an at-a-glance matrix grid.Reviewed by Cursor Bugbot for commit 5ead905. Bugbot is set up for automated code reviews on this repo. Configure here.