fix(health): correct bedrock embedding health checks by mateo-berri · Pull Request #30583 · BerriAI/litellm

mateo-berri · 2026-06-17T00:34:32Z

Relevant issues

Customer thread (ticket #4474) reporting two Bedrock embedding setup errors that have no PR yet. The Bedrock Mantle config/auth part of the same thread already shipped via #29490, #30083, #30163, #30426 and #29788; this PR covers the two embedding bugs that were left unaddressed.

Linear ticket

LIT-3747

Pre-Submission checklist

I have added meaningful tests
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible; it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

What was broken

A Bedrock embedding deployment configured the natural way, without an explicit model_info.mode, failed its health check for two independent reasons.

First, the health-check builder treated a missing mode as chat and injected max_tokens into the probe. Bedrock embeddings reject unknown fields, so bedrock/amazon.titan-embed-text-v2:0 came back 400 "Malformed input request: extraneous key [max_tokens]", and drop_params did not help because the param is injected by the health-check builder rather than mapped through provider drop logic. The only workaround was setting model_info: {mode: embedding} by hand.

Second, the Bedrock rewrite strips the bedrock/ routing prefix but did not record the provider, so a cross-region inference-profile id like bedrock/us.cohere.embed-v4:0 became the bare us.cohere.embed-v4:0 and ahealth_check's get_llm_provider raised litellm.BadRequestError: LLM Provider NOT provided (the us. prefix keeps it out of bedrock_embedding_models).

Changes

litellm/proxy/health_check.py now resolves a deployment's effective mode from the model cost map before deciding whether to inject max_tokens. litellm.get_model_info already understands the bedrock/ and us./eu./apac. prefixes, so titan and cohere both resolve to embedding and the probe no longer carries max_tokens. That same resolved mode now also gates the reasoning_effort and audio_speech voice injections, so an embedding deployment auto-detected without an explicit model_info.mode no longer picks up a configured reasoning_effort that the embeddings endpoint would reject. When the rewrite strips bedrock/, it pins custom_llm_provider to bedrock only when the deployment hasn't already set one, so a bare cross-region id still resolves to the provider while an explicit custom_llm_provider: bedrock_converse survives untouched. litellm.ahealth_check's mode parameter is widened from a strict Literal to Optional[str] (it only uses the value as a handler key and already tolerates arbitrary modes), so the resolved embedding mode flows through and routes the probe to the embedding handler.

Screenshots / Proof of Fix

I could not exercise this against live AWS Bedrock from the sandbox (no Bedrock credentials), so here is the runbook to capture the proof against a real account.

Add to litellm/proxy/dev_config.yaml:

model_list:
  - model_name: titan-embed-text-v2
    litellm_params:
      model: bedrock/amazon.titan-embed-text-v2:0
      aws_region_name: us-east-1
  - model_name: cohere-embed
    litellm_params:
      model: bedrock/us.cohere.embed-v4:0
      aws_region_name: us-east-1

Start the proxy:

python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log

Hit the health endpoint and confirm both deployments are healthy (before this change they showed up under unhealthy_endpoints with the two errors above):

curl -s -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health | jq '.healthy_endpoints, .unhealthy_endpoints'

Or in the UI, go to http://localhost:4000/ui/?page=models, click "Health Status" / run a health check on both deployments, and confirm both are green.

Type

🐛 Bug Fix

Note

Low Risk
Scoped to proxy health-check param building and a typing widen on ahealth_check; behavior change is intentional for Bedrock embeddings with backward-compatible fallback when mode is unknown.

Overview
Fixes proxy health checks for Bedrock embedding deployments that omit model_info.mode.

Mode resolution: Adds _resolve_health_check_mode, which uses model_info.mode when set, otherwise litellm.get_model_info (handles bedrock/ and cross-region us./eu./apac. ids). That resolved mode drives max_tokens injection, reasoning_effort, audio/voice handling, and the mode passed into ahealth_check—so embeddings are no longer probed as chat with max_tokens or stray reasoning_effort.

Bedrock routing: After stripping bedrock/ (and region segments) from the model id, the builder sets custom_llm_provider to bedrock only when unset, so bare ids like us.cohere.embed-v4:0 still resolve while explicit bedrock_converse is preserved.

API typing: ahealth_check's mode is relaxed from a strict Literal to Optional[str] so resolved modes (e.g. embedding) flow through.

Tests cover Titan/Cohere embedding paths, explicit mode override, chat regression, provider pin vs preserve, and threading mode into ahealth_check.

^{Reviewed by Cursor Bugbot for commit 66692e2. Bugbot is set up for automated code reviews on this repo. Configure here.}

codecov · 2026-06-17T00:37:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-06-17T02:02:04Z

Greptile Summary

This PR fixes two independent Bedrock embedding health-check failures: max_tokens being injected into embedding probes (causing 400 "extraneous key [max_tokens]") and cross-region inference-profile ids losing their provider after prefix stripping (causing "LLM Provider NOT provided").

Adds _resolve_health_check_mode() to auto-detect a deployment's effective mode from the model cost map (which understands bedrock/ and us./eu./apac. prefixes) when no explicit model_info.mode is set; the resolved mode now gates max_tokens, reasoning_effort, and voice injection.
After stripping the bedrock/ routing prefix, custom_llm_provider is pinned to "bedrock" only when the deployment hasn't already set one, so an explicit bedrock_converse survives untouched.
ahealth_check's mode parameter is widened from Literal[...] to str | None so auto-detected modes flow through to the correct handler.

Confidence Score: 5/5

Safe to merge — all three changed files are narrowly scoped to health-check paths, with no impact on the live request path.

The two bug fixes are well-contained: mode resolution falls back gracefully (returns None → treated as chat) for unknown models, the provider pin is correctly guarded with if not litellm_params.get('custom_llm_provider'), and every changed branch has a corresponding unit test. The previous reviewer concern about unconditional provider override is addressed. The ahealth_check signature widening is backward-compatible.

No files require special attention.

Important Files Changed

Filename	Overview
litellm/proxy/health_check.py	Adds _resolve_health_check_mode() to auto-detect embedding/speech modes from the model cost map; threads the resolved mode through max_tokens, reasoning_effort, and audio_speech guards; pins custom_llm_provider to "bedrock" only when unset after prefix stripping — correctly preserving an explicit bedrock_converse.
litellm/main.py	Widens the ahealth_check mode parameter from Literal[...] to str
tests/test_litellm/proxy/test_health_check_max_tokens.py	Adds unit tests covering both bug fixes: no max_tokens/reasoning_effort injected into Bedrock embedding probes, provider pinned after prefix strip, explicit provider preserved, and mode correctly threaded to ahealth_check.

_{Reviews (3): Last reviewed commit: "fix(health): resolve probe mode once for..." | Re-trigger Greptile}

mateo-berri · 2026-06-17T02:04:18Z

@greptileai

mateo-berri · 2026-06-17T02:15:54Z

Addressed the bedrock_converse override (P1) in 5d1be34. The prefix-strip now fills in custom_llm_provider: bedrock only when the deployment left the provider blank, so a bare cross-region id like us.cohere.embed-v4:0 still resolves while an explicit bedrock_converse is left untouched and the probe keeps hitting the Converse endpoint. Added a regression test (test_bedrock_prefix_strip_preserves_explicit_custom_llm_provider) that fails on the old unconditional write

The line 468/476 notes (P2; reasoning_effort and audio_speech reading model_info.get("mode") directly) are pre-existing behavior that this PR doesn't touch, and the reasoning_effort path only fires when an operator sets health_check_reasoning_effort on a non-chat deployment, which isn't a real config. I'm leaving those out to keep the PR scoped to the two embedding bugs, which lines up with the latest Greptile pass marking them out of scope

Health checks for Bedrock embedding deployments failed in two ways. A deployment configured without an explicit model_info.mode was probed as chat, so max_tokens was injected and Bedrock embeddings rejected it with 400 "extraneous key [max_tokens]". Separately, stripping the bedrock/ routing prefix dropped the provider, so a cross-region inference-profile id like us.cohere.embed-v4:0 failed downstream with "LLM Provider NOT provided". Resolve the deployment mode from the model cost map (which understands the bedrock/ and us./eu./apac. prefixes) before deciding whether to inject max_tokens, and pin custom_llm_provider to bedrock when stripping the prefix so the bare model id still resolves. ahealth_check now accepts any string mode so the resolved embedding mode routes the probe to the embedding handler.

The bedrock prefix-strip pinned custom_llm_provider to bedrock unconditionally, so a deployment that set custom_llm_provider: bedrock_converse had it overwritten at health-check time and the probe hit the Invoke endpoint instead of Converse, a different request format that can report a spurious failure. Only fill in bedrock when the deployment left the provider blank, which still resolves bare cross-region ids like us.cohere.embed-v4:0 while leaving an explicit provider untouched.

The existing tests check _resolve_health_check_mode and the params builder in isolation, but nothing verified that _run_model_health_check actually threads the resolved mode into litellm.ahealth_check. Without that, a refactor that probed with model_info.get("mode") again would reintroduce the chat fallback for embedding deployments while every test stayed green. This drives _run_model_health_check with a bedrock embedding deployment and asserts the probe is called with mode=embedding and the embedding params.

…peech The reasoning_effort and audio_speech branches read model_info.mode directly, so an embedding deployment declared without an explicit mode (the case this PR targets) was still treated as chat-like: a configured health_check_reasoning_effort got injected into the embedding probe, which embeddings reject as an unknown field, and an auto-detected audio_speech deployment never had its voice set. Resolve the effective mode once from the cost map and reuse it for the max_tokens, reasoning_effort, and audio_speech decisions so they all agree with the mode threaded into ahealth_check.

mateo-berri · 2026-06-17T02:41:13Z

@greptileai

mateo-berri · 2026-06-17T02:48:14Z

bugbot run

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 66692e2. Configure here.}

mateo-berri · 2026-06-17T04:02:22Z

Manual QA

Before

After

Sameerlite · 2026-06-17T13:28:37Z

-            "ocr",
-        ]
-    ] = "chat",
+    mode: str | None = "chat",


why would we remove the literal from this?

ahealth_check only uses mode as a lookup key into its handler dict and already validates it at runtime (raises "Mode X not supported" for misses), and it even reassigns mode from the cost map internally; so the Literal was stricter than the function's own behavior. On top of that, the resolved mode also drives the max_tokens/reasoning_effort/voice decisions, which need the raw string to recognize non-chat modes; if we narrowed it to the literal union, a mode like moderation would fail validation, collapse to None, get treated as chat, and the max_tokens 400 bug would come back. Widening to str | None just makes the signature match what the function actually does; keeping the literal would require a cast that asserts a type the runtime values don't really satisfy

mateo-berri force-pushed the litellm_fix_bedrock_embedding_health_checks branch 2 times, most recently from e9292d0 to 6696899 Compare June 17, 2026 00:49

mateo-berri marked this pull request as ready for review June 17, 2026 01:59

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread litellm/proxy/health_check.py Outdated

mateo-berri added 2 commits June 17, 2026 02:17

mateo-berri force-pushed the litellm_fix_bedrock_embedding_health_checks branch from 5d1be34 to 63174ac Compare June 17, 2026 02:19

mateo-berri added 2 commits June 17, 2026 02:28

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Sameerlite reviewed Jun 17, 2026

View reviewed changes

mateo-berri requested a review from yuneng-berri June 17, 2026 20:45

yuneng-berri approved these changes Jun 17, 2026

View reviewed changes

mateo-berri merged commit c51ba34 into litellm_internal_staging Jun 17, 2026
126 checks passed

mateo-berri deleted the litellm_fix_bedrock_embedding_health_checks branch June 17, 2026 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(health): correct bedrock embedding health checks#30583

fix(health): correct bedrock embedding health checks#30583
mateo-berri merged 4 commits into
litellm_internal_stagingfrom
litellm_fix_bedrock_embedding_health_checks

mateo-berri commented Jun 17, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

mateo-berri commented Jun 17, 2026 •

edited

Loading

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

Sameerlite Jun 17, 2026

Uh oh!

mateo-berri Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mateo-berri commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Linear ticket

Pre-Submission checklist

What was broken

Changes

Screenshots / Proof of Fix

Type

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

mateo-berri commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

mateo-berri commented Jun 17, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mateo-berri commented Jun 17, 2026

Manual QA

Before

After

Uh oh!

Sameerlite Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mateo-berri Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mateo-berri commented Jun 17, 2026 •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

mateo-berri commented Jun 17, 2026 •

edited

Loading