Skip to content

fix(health): correct bedrock embedding health checks#30583

Merged
mateo-berri merged 4 commits into
litellm_internal_stagingfrom
litellm_fix_bedrock_embedding_health_checks
Jun 17, 2026
Merged

fix(health): correct bedrock embedding health checks#30583
mateo-berri merged 4 commits into
litellm_internal_stagingfrom
litellm_fix_bedrock_embedding_health_checks

Conversation

@mateo-berri

@mateo-berri mateo-berri commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Relevant issues

Customer thread (ticket #4474) reporting two Bedrock embedding setup errors that have no PR yet. The Bedrock Mantle config/auth part of the same thread already shipped via #29490, #30083, #30163, #30426 and #29788; this PR covers the two embedding bugs that were left unaddressed.

Linear ticket

LIT-3747

Pre-Submission checklist

  • I have added meaningful tests
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible; it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

What was broken

A Bedrock embedding deployment configured the natural way, without an explicit model_info.mode, failed its health check for two independent reasons.

First, the health-check builder treated a missing mode as chat and injected max_tokens into the probe. Bedrock embeddings reject unknown fields, so bedrock/amazon.titan-embed-text-v2:0 came back 400 "Malformed input request: extraneous key [max_tokens]", and drop_params did not help because the param is injected by the health-check builder rather than mapped through provider drop logic. The only workaround was setting model_info: {mode: embedding} by hand.

Second, the Bedrock rewrite strips the bedrock/ routing prefix but did not record the provider, so a cross-region inference-profile id like bedrock/us.cohere.embed-v4:0 became the bare us.cohere.embed-v4:0 and ahealth_check's get_llm_provider raised litellm.BadRequestError: LLM Provider NOT provided (the us. prefix keeps it out of bedrock_embedding_models).

Changes

litellm/proxy/health_check.py now resolves a deployment's effective mode from the model cost map before deciding whether to inject max_tokens. litellm.get_model_info already understands the bedrock/ and us./eu./apac. prefixes, so titan and cohere both resolve to embedding and the probe no longer carries max_tokens. That same resolved mode now also gates the reasoning_effort and audio_speech voice injections, so an embedding deployment auto-detected without an explicit model_info.mode no longer picks up a configured reasoning_effort that the embeddings endpoint would reject. When the rewrite strips bedrock/, it pins custom_llm_provider to bedrock only when the deployment hasn't already set one, so a bare cross-region id still resolves to the provider while an explicit custom_llm_provider: bedrock_converse survives untouched. litellm.ahealth_check's mode parameter is widened from a strict Literal to Optional[str] (it only uses the value as a handler key and already tolerates arbitrary modes), so the resolved embedding mode flows through and routes the probe to the embedding handler.

Screenshots / Proof of Fix

I could not exercise this against live AWS Bedrock from the sandbox (no Bedrock credentials), so here is the runbook to capture the proof against a real account.

  1. Add to litellm/proxy/dev_config.yaml:
model_list:
  - model_name: titan-embed-text-v2
    litellm_params:
      model: bedrock/amazon.titan-embed-text-v2:0
      aws_region_name: us-east-1
  - model_name: cohere-embed
    litellm_params:
      model: bedrock/us.cohere.embed-v4:0
      aws_region_name: us-east-1
  1. Start the proxy:
python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log
  1. Hit the health endpoint and confirm both deployments are healthy (before this change they showed up under unhealthy_endpoints with the two errors above):
curl -s -H "Authorization: Bearer $LITELLM_MASTER_KEY" http://localhost:4000/health | jq '.healthy_endpoints, .unhealthy_endpoints'
  1. Or in the UI, go to http://localhost:4000/ui/?page=models, click "Health Status" / run a health check on both deployments, and confirm both are green.

Type

🐛 Bug Fix


Note

Low Risk
Scoped to proxy health-check param building and a typing widen on ahealth_check; behavior change is intentional for Bedrock embeddings with backward-compatible fallback when mode is unknown.

Overview
Fixes proxy health checks for Bedrock embedding deployments that omit model_info.mode.

Mode resolution: Adds _resolve_health_check_mode, which uses model_info.mode when set, otherwise litellm.get_model_info (handles bedrock/ and cross-region us./eu./apac. ids). That resolved mode drives max_tokens injection, reasoning_effort, audio/voice handling, and the mode passed into ahealth_check—so embeddings are no longer probed as chat with max_tokens or stray reasoning_effort.

Bedrock routing: After stripping bedrock/ (and region segments) from the model id, the builder sets custom_llm_provider to bedrock only when unset, so bare ids like us.cohere.embed-v4:0 still resolve while explicit bedrock_converse is preserved.

API typing: ahealth_check's mode is relaxed from a strict Literal to Optional[str] so resolved modes (e.g. embedding) flow through.

Tests cover Titan/Cohere embedding paths, explicit mode override, chat regression, provider pin vs preserve, and threading mode into ahealth_check.

Reviewed by Cursor Bugbot for commit 66692e2. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@mateo-berri mateo-berri force-pushed the litellm_fix_bedrock_embedding_health_checks branch 2 times, most recently from e9292d0 to 6696899 Compare June 17, 2026 00:49
@mateo-berri mateo-berri marked this pull request as ready for review June 17, 2026 01:59
@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two independent Bedrock embedding health-check failures: max_tokens being injected into embedding probes (causing 400 "extraneous key [max_tokens]") and cross-region inference-profile ids losing their provider after prefix stripping (causing "LLM Provider NOT provided").

  • Adds _resolve_health_check_mode() to auto-detect a deployment's effective mode from the model cost map (which understands bedrock/ and us./eu./apac. prefixes) when no explicit model_info.mode is set; the resolved mode now gates max_tokens, reasoning_effort, and voice injection.
  • After stripping the bedrock/ routing prefix, custom_llm_provider is pinned to "bedrock" only when the deployment hasn't already set one, so an explicit bedrock_converse survives untouched.
  • ahealth_check's mode parameter is widened from Literal[...] to str | None so auto-detected modes flow through to the correct handler.

Confidence Score: 5/5

Safe to merge — all three changed files are narrowly scoped to health-check paths, with no impact on the live request path.

The two bug fixes are well-contained: mode resolution falls back gracefully (returns None → treated as chat) for unknown models, the provider pin is correctly guarded with if not litellm_params.get('custom_llm_provider'), and every changed branch has a corresponding unit test. The previous reviewer concern about unconditional provider override is addressed. The ahealth_check signature widening is backward-compatible.

No files require special attention.

Important Files Changed

Filename Overview
litellm/proxy/health_check.py Adds _resolve_health_check_mode() to auto-detect embedding/speech modes from the model cost map; threads the resolved mode through max_tokens, reasoning_effort, and audio_speech guards; pins custom_llm_provider to "bedrock" only when unset after prefix stripping — correctly preserving an explicit bedrock_converse.
litellm/main.py Widens the ahealth_check mode parameter from Literal[...] to str
tests/test_litellm/proxy/test_health_check_max_tokens.py Adds unit tests covering both bug fixes: no max_tokens/reasoning_effort injected into Bedrock embedding probes, provider pinned after prefix strip, explicit provider preserved, and mode correctly threaded to ahealth_check.

Reviews (3): Last reviewed commit: "fix(health): resolve probe mode once for..." | Re-trigger Greptile

Comment thread litellm/proxy/health_check.py Outdated
@mateo-berri

Copy link
Copy Markdown
Collaborator Author

@greptileai

@mateo-berri

mateo-berri commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the bedrock_converse override (P1) in 5d1be34. The prefix-strip now fills in custom_llm_provider: bedrock only when the deployment left the provider blank, so a bare cross-region id like us.cohere.embed-v4:0 still resolves while an explicit bedrock_converse is left untouched and the probe keeps hitting the Converse endpoint. Added a regression test (test_bedrock_prefix_strip_preserves_explicit_custom_llm_provider) that fails on the old unconditional write

The line 468/476 notes (P2; reasoning_effort and audio_speech reading model_info.get("mode") directly) are pre-existing behavior that this PR doesn't touch, and the reasoning_effort path only fires when an operator sets health_check_reasoning_effort on a non-chat deployment, which isn't a real config. I'm leaving those out to keep the PR scoped to the two embedding bugs, which lines up with the latest Greptile pass marking them out of scope

Health checks for Bedrock embedding deployments failed in two ways. A
deployment configured without an explicit model_info.mode was probed as
chat, so max_tokens was injected and Bedrock embeddings rejected it with
400 "extraneous key [max_tokens]". Separately, stripping the bedrock/
routing prefix dropped the provider, so a cross-region inference-profile
id like us.cohere.embed-v4:0 failed downstream with "LLM Provider NOT
provided".

Resolve the deployment mode from the model cost map (which understands
the bedrock/ and us./eu./apac. prefixes) before deciding whether to
inject max_tokens, and pin custom_llm_provider to bedrock when stripping
the prefix so the bare model id still resolves. ahealth_check now accepts
any string mode so the resolved embedding mode routes the probe to the
embedding handler.
The bedrock prefix-strip pinned custom_llm_provider to bedrock
unconditionally, so a deployment that set custom_llm_provider:
bedrock_converse had it overwritten at health-check time and the probe
hit the Invoke endpoint instead of Converse, a different request format
that can report a spurious failure. Only fill in bedrock when the
deployment left the provider blank, which still resolves bare
cross-region ids like us.cohere.embed-v4:0 while leaving an explicit
provider untouched.
@mateo-berri mateo-berri force-pushed the litellm_fix_bedrock_embedding_health_checks branch from 5d1be34 to 63174ac Compare June 17, 2026 02:19
The existing tests check _resolve_health_check_mode and the params builder
in isolation, but nothing verified that _run_model_health_check actually
threads the resolved mode into litellm.ahealth_check. Without that, a
refactor that probed with model_info.get("mode") again would reintroduce
the chat fallback for embedding deployments while every test stayed green.
This drives _run_model_health_check with a bedrock embedding deployment and
asserts the probe is called with mode=embedding and the embedding params.
…peech

The reasoning_effort and audio_speech branches read model_info.mode
directly, so an embedding deployment declared without an explicit mode (the
case this PR targets) was still treated as chat-like: a configured
health_check_reasoning_effort got injected into the embedding probe, which
embeddings reject as an unknown field, and an auto-detected audio_speech
deployment never had its voice set. Resolve the effective mode once from the
cost map and reuse it for the max_tokens, reasoning_effort, and audio_speech
decisions so they all agree with the mode threaded into ahealth_check.
@mateo-berri

Copy link
Copy Markdown
Collaborator Author

@greptileai

@mateo-berri

Copy link
Copy Markdown
Collaborator Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 66692e2. Configure here.

@mateo-berri

Copy link
Copy Markdown
Collaborator Author

Manual QA

Before

image

After

image

Comment thread litellm/main.py
"ocr",
]
] = "chat",
mode: str | None = "chat",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we remove the literal from this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahealth_check only uses mode as a lookup key into its handler dict and already validates it at runtime (raises "Mode X not supported" for misses), and it even reassigns mode from the cost map internally; so the Literal was stricter than the function's own behavior. On top of that, the resolved mode also drives the max_tokens/reasoning_effort/voice decisions, which need the raw string to recognize non-chat modes; if we narrowed it to the literal union, a mode like moderation would fail validation, collapse to None, get treated as chat, and the max_tokens 400 bug would come back. Widening to str | None just makes the signature match what the function actually does; keeping the literal would require a cast that asserts a type the runtime values don't really satisfy

@mateo-berri mateo-berri requested a review from yuneng-berri June 17, 2026 20:45
@mateo-berri mateo-berri merged commit c51ba34 into litellm_internal_staging Jun 17, 2026
126 checks passed
@mateo-berri mateo-berri deleted the litellm_fix_bedrock_embedding_health_checks branch June 17, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants