fix(health): correct bedrock embedding health checks#30583
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
e9292d0 to
6696899
Compare
Greptile SummaryThis PR fixes two independent Bedrock embedding health-check failures:
Confidence Score: 5/5Safe to merge — all three changed files are narrowly scoped to health-check paths, with no impact on the live request path. The two bug fixes are well-contained: mode resolution falls back gracefully (returns None → treated as chat) for unknown models, the provider pin is correctly guarded with if not litellm_params.get('custom_llm_provider'), and every changed branch has a corresponding unit test. The previous reviewer concern about unconditional provider override is addressed. The ahealth_check signature widening is backward-compatible. No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/proxy/health_check.py | Adds _resolve_health_check_mode() to auto-detect embedding/speech modes from the model cost map; threads the resolved mode through max_tokens, reasoning_effort, and audio_speech guards; pins custom_llm_provider to "bedrock" only when unset after prefix stripping — correctly preserving an explicit bedrock_converse. |
| litellm/main.py | Widens the ahealth_check mode parameter from Literal[...] to str |
| tests/test_litellm/proxy/test_health_check_max_tokens.py | Adds unit tests covering both bug fixes: no max_tokens/reasoning_effort injected into Bedrock embedding probes, provider pinned after prefix strip, explicit provider preserved, and mode correctly threaded to ahealth_check. |
Reviews (3): Last reviewed commit: "fix(health): resolve probe mode once for..." | Re-trigger Greptile
|
Addressed the The line 468/476 notes (P2; |
Health checks for Bedrock embedding deployments failed in two ways. A deployment configured without an explicit model_info.mode was probed as chat, so max_tokens was injected and Bedrock embeddings rejected it with 400 "extraneous key [max_tokens]". Separately, stripping the bedrock/ routing prefix dropped the provider, so a cross-region inference-profile id like us.cohere.embed-v4:0 failed downstream with "LLM Provider NOT provided". Resolve the deployment mode from the model cost map (which understands the bedrock/ and us./eu./apac. prefixes) before deciding whether to inject max_tokens, and pin custom_llm_provider to bedrock when stripping the prefix so the bare model id still resolves. ahealth_check now accepts any string mode so the resolved embedding mode routes the probe to the embedding handler.
The bedrock prefix-strip pinned custom_llm_provider to bedrock unconditionally, so a deployment that set custom_llm_provider: bedrock_converse had it overwritten at health-check time and the probe hit the Invoke endpoint instead of Converse, a different request format that can report a spurious failure. Only fill in bedrock when the deployment left the provider blank, which still resolves bare cross-region ids like us.cohere.embed-v4:0 while leaving an explicit provider untouched.
5d1be34 to
63174ac
Compare
The existing tests check _resolve_health_check_mode and the params builder
in isolation, but nothing verified that _run_model_health_check actually
threads the resolved mode into litellm.ahealth_check. Without that, a
refactor that probed with model_info.get("mode") again would reintroduce
the chat fallback for embedding deployments while every test stayed green.
This drives _run_model_health_check with a bedrock embedding deployment and
asserts the probe is called with mode=embedding and the embedding params.
…peech The reasoning_effort and audio_speech branches read model_info.mode directly, so an embedding deployment declared without an explicit mode (the case this PR targets) was still treated as chat-like: a configured health_check_reasoning_effort got injected into the embedding probe, which embeddings reject as an unknown field, and an auto-detected audio_speech deployment never had its voice set. Resolve the effective mode once from the cost map and reuse it for the max_tokens, reasoning_effort, and audio_speech decisions so they all agree with the mode threaded into ahealth_check.
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 66692e2. Configure here.
| "ocr", | ||
| ] | ||
| ] = "chat", | ||
| mode: str | None = "chat", |
There was a problem hiding this comment.
why would we remove the literal from this?
There was a problem hiding this comment.
ahealth_check only uses mode as a lookup key into its handler dict and already validates it at runtime (raises "Mode X not supported" for misses), and it even reassigns mode from the cost map internally; so the Literal was stricter than the function's own behavior. On top of that, the resolved mode also drives the max_tokens/reasoning_effort/voice decisions, which need the raw string to recognize non-chat modes; if we narrowed it to the literal union, a mode like moderation would fail validation, collapse to None, get treated as chat, and the max_tokens 400 bug would come back. Widening to str | None just makes the signature match what the function actually does; keeping the literal would require a cast that asserts a type the runtime values don't really satisfy


Relevant issues
Customer thread (ticket #4474) reporting two Bedrock embedding setup errors that have no PR yet. The Bedrock Mantle config/auth part of the same thread already shipped via #29490, #30083, #30163, #30426 and #29788; this PR covers the two embedding bugs that were left unaddressed.
Linear ticket
LIT-3747
Pre-Submission checklist
make test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewWhat was broken
A Bedrock embedding deployment configured the natural way, without an explicit
model_info.mode, failed its health check for two independent reasons.First, the health-check builder treated a missing mode as
chatand injectedmax_tokensinto the probe. Bedrock embeddings reject unknown fields, sobedrock/amazon.titan-embed-text-v2:0came back400 "Malformed input request: extraneous key [max_tokens]", anddrop_paramsdid not help because the param is injected by the health-check builder rather than mapped through provider drop logic. The only workaround was settingmodel_info: {mode: embedding}by hand.Second, the Bedrock rewrite strips the
bedrock/routing prefix but did not record the provider, so a cross-region inference-profile id likebedrock/us.cohere.embed-v4:0became the bareus.cohere.embed-v4:0andahealth_check'sget_llm_providerraisedlitellm.BadRequestError: LLM Provider NOT provided(theus.prefix keeps it out ofbedrock_embedding_models).Changes
litellm/proxy/health_check.pynow resolves a deployment's effective mode from the model cost map before deciding whether to injectmax_tokens.litellm.get_model_infoalready understands thebedrock/andus./eu./apac.prefixes, so titan and cohere both resolve toembeddingand the probe no longer carriesmax_tokens. That same resolved mode now also gates thereasoning_effortandaudio_speechvoiceinjections, so an embedding deployment auto-detected without an explicitmodel_info.modeno longer picks up a configuredreasoning_effortthat the embeddings endpoint would reject. When the rewrite stripsbedrock/, it pinscustom_llm_providertobedrockonly when the deployment hasn't already set one, so a bare cross-region id still resolves to the provider while an explicitcustom_llm_provider: bedrock_conversesurvives untouched.litellm.ahealth_check'smodeparameter is widened from a strictLiteraltoOptional[str](it only uses the value as a handler key and already tolerates arbitrary modes), so the resolved embedding mode flows through and routes the probe to the embedding handler.Screenshots / Proof of Fix
I could not exercise this against live AWS Bedrock from the sandbox (no Bedrock credentials), so here is the runbook to capture the proof against a real account.
litellm/proxy/dev_config.yaml:unhealthy_endpointswith the two errors above):Type
🐛 Bug Fix
Note
Low Risk
Scoped to proxy health-check param building and a typing widen on
ahealth_check; behavior change is intentional for Bedrock embeddings with backward-compatible fallback when mode is unknown.Overview
Fixes proxy health checks for Bedrock embedding deployments that omit
model_info.mode.Mode resolution: Adds
_resolve_health_check_mode, which usesmodel_info.modewhen set, otherwiselitellm.get_model_info(handlesbedrock/and cross-regionus./eu./apac.ids). That resolved mode drivesmax_tokensinjection,reasoning_effort, audio/voice handling, and themodepassed intoahealth_check—so embeddings are no longer probed as chat withmax_tokensor strayreasoning_effort.Bedrock routing: After stripping
bedrock/(and region segments) from the model id, the builder setscustom_llm_providertobedrockonly when unset, so bare ids likeus.cohere.embed-v4:0still resolve while explicitbedrock_converseis preserved.API typing:
ahealth_check'smodeis relaxed from a strictLiteraltoOptional[str]so resolved modes (e.g.embedding) flow through.Tests cover Titan/Cohere embedding paths, explicit mode override, chat regression, provider pin vs preserve, and threading mode into
ahealth_check.Reviewed by Cursor Bugbot for commit 66692e2. Bugbot is set up for automated code reviews on this repo. Configure here.