fix(advisor): route through proxy, render native blocks, and attribute per-model usage by samagana · Pull Request #30546 · BerriAI/litellm

samagana · 2026-06-16T17:57:17Z

Relevant issues

Upstream advisor rollout: #25516. Ported from scaledata/litellm PRs #4, #7, #8

Pre-Submission checklist

I have added meaningful tests
My PR passes all unit tests on make test-unit (the touched/related suites pass; 32 tests across 4 test files)
My PR's scope is as isolated as possible; it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

What and why

Three related fixes to the advisor orchestration loop for non-Anthropic executor providers, combined into a single PR since they form a coherent stack (each builds on the prior). Follow-up commits address all review findings from Greptile and Veria-AI.

1. Route sub-calls through the proxy router (scaledata/litellm#4)

Routes advisor and executor sub-calls through the proxy's llm_router in-process when available (falling back to direct anthropic_messages() for standalone SDK use), instead of requiring ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL pointed at localhost with the master key. Stops passing api_key=None/api_base=None to the advisor sub-call, which would override the router's resolved deployment credentials during kwargs merging. Forwards the caller's litellm_metadata into the advisor sub-call so the advisor model's spend is attributed to the caller's key and budget. Gives the advisor sub-call its own role system prompt so it answers as the advisor rather than adopting the executor's persona.

2. Render advisor activity in clients (scaledata/litellm#7)

Through the proxy, a non-Anthropic executor with an advisor_20260301 tool was orchestrated entirely in litellm: the executor/advisor loop ran in-process and only the final, flattened executor response was returned. Claude Code drives its advisor UI from server_tool_use (name advisor) and advisor_tool_result content blocks; when the executor is non-Anthropic the server cannot run that loop, so this change has our orchestrator synthesize the same blocks, making the proxy path render identically to the native one. For streaming, the server_tool_use block is flushed before the advisor sub-call is awaited, so the advisor renders as in-progress for its real latency and resolves when the result arrives.

Also fixes strip_advisor_blocks_from_messages to recover advice text from the native {"type":"advisor_result","text":...} content shape (previously handled only str/list content and silently dropped the dict's text, breaking the multi-turn round-trip).

Also extracts the shared build_content_block_chunks helper in fake_stream_iterator.py and teaches it the server_tool_use and advisor_tool_result block types (previously unknown types were silently dropped from the stream).

3. Attribute advisor spend to its own model (scaledata/litellm#8)

When an advisor consult ran through the proxy, Claude Code's /usage attributed all spend to the executor model and showed nothing for the advisor model. Claude Code computes per-model usage from usage.iterations[] entries of type:"advisor_message". This change emits usage.iterations[] from the orchestrator: one advisor_message entry per advisor sub-call, carrying the advisor model and its token counts, on both the non-streaming response and the streaming message_delta.

4. Review feedback fixes

Addresses all automated review findings:

Model authorization bypass (Veria-AI, High): validates advisor_model against the caller's UserAPIKeyAuth via can_key_call_model before entering the loop, following the same pattern as the MCP sampling handler. Skipped in standalone SDK mode (no auth layer)
Duplicated _sse helper (Greptile): removed the copy from advisor.py, now imports from fake_stream_iterator.py
message_start.usage zero tokens (Greptile): the streaming path now peeks the first loop event to extract the executor's real input_tokens before emitting message_start, matching the Anthropic SSE protocol and FakeAnthropicMessagesStreamIterator
Orphan content_block_stop (Greptile): build_content_block_chunks now only emits content_block_stop when a content_block_start was emitted, so unknown block types produce no chunks instead of a protocol-violating orphan stop

CI (LiteLLM team)

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Run a proxy with a non-Anthropic executor and an Anthropic advisor, then drive the advisor through the streaming /v1/messages passthrough hitting real providers:

# dev_config.yaml
model_list:
  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-2.5-flash
      api_key: os.environ/GEMINI_API_KEY
  - model_name: opus-advisor
    litellm_params:
      model: anthropic/claude-opus-4-8
      api_key: os.environ/ANTHROPIC_API_KEY

python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log

curl -N http://localhost:4000/v1/messages \
  -H "content-type: application/json" -H "x-api-key: $LITELLM_MASTER_KEY" \
  -H "anthropic-beta: advisor-tool-2026-03-01" \
  -d '{"model":"gemini-flash","max_tokens":1024,"stream":true,
       "tools":[{"type":"advisor_20260301","name":"advisor","model":"opus-advisor"}],
       "messages":[{"role":"user","content":"Before answering, consult your advisor: cleanest way to debounce in React? Then summarize."}]}'

Expected SSE order: content_block_start for server_tool_use (name advisor), a pause for the advisor latency, then content_block_start for advisor_tool_result, then the executor text. The message_start carries real input_tokens from the executor, and the message_delta carries usage.iterations[] with the advisor model's token counts. In Claude Code pointed at the proxy with /advisor opus-advisor, the advisor dot shows in-progress then resolves, and /usage shows a separate cost line for the advisor model

Type

🐛 Bug Fix

Changes

See the "What and why" section above

…ng env vars The advisor orchestration handler called anthropic_messages() directly for sub-calls, which required ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL to be set. In a proxy setup, the workaround was pointing those env vars at localhost with the master key, creating a pointless HTTP round-trip. Now _call_messages_handler checks for the proxy's llm_router and routes through it when available. The router resolves model deployments and credentials from the proxy config in-process. Falls back to the direct anthropic_messages() call for standalone SDK usage (no proxy). Also fixes a bug where api_key=None / api_base=None were always passed to the advisor sub-call, which would override the router's deployment credentials during kwargs merging. Now only passes them when explicitly set in the advisor tool definition.

The advisor sub-call routed through the proxy router but carried none of the caller's litellm_metadata (user_api_key/team/budget/session), so the advisor model's spend was never attributed to the caller's key or budget and the call did not group under the session in the UI. Forward litellm_metadata into the advisor leg so it is tracked exactly like the executor leg. Forward only litellm_metadata, not the executor's generation params: feeding the advisor the executor's agent system prompt (and tool_choice) made it mimic the executor and echo the advisor call instead of answering the question. api_key/api_base still come from the advisor tool definition only. Regression test asserts the advisor leg receives litellm_metadata but not the executor's system/tool_choice.

In our orchestration the advisor is handed a plain /v1/messages request rather than Anthropic's native server-side framing. With no role of its own it adopted the executor's persona from the forwarded conversation and refused or punted ("there's no separate advisor I can query, I'm answering directly"), which read to the executor as the advisor kicking the task back. Add ADVISOR_SYSTEM_PROMPT and pass it as the advisor leg's system prompt so the advisor answers as the advisor. The executor's own system prompt is still not forwarded, and the forwarded conversation context is unchanged; this only adds the role. A regression test asserts the advisor leg carries the advisor role prompt and not the executor's.

The rebase conflict resolution accidentally stripped comments from upstream's test_named_params_forwarded_into_advisor_executor_subcall and test_pre_request_hook_override_does_not_collide_with_explicit_kwargs. Restoring them so the PR diff only adds new tests without modifying upstream code.

…dvisor blocks The non-native advisor orchestrator flattened the executor/advisor loop into a single final response, so Claude Code (and any client) saw a plain message with no signal that an advisor was consulted. Surface each advisor exchange as the native server_tool_use (name "advisor") and advisor_tool_result blocks that clients key their advisor UI on Streaming now runs a real orchestrator: the server_tool_use block is flushed before the advisor sub-call is awaited, so the advisor renders as in-progress for its real latency and resolves when the result arrives, matching the native server-side experience. Non-streaming and streaming share one _run_loop generator so the two paths cannot diverge Also fix strip_advisor_blocks_from_messages to recover advice text from the native {"type":"advisor_result","text":...} content shape; it previously handled only str/list content and silently dropped the dict's text, which broke the multi-turn round-trip once we started emitting these blocks

…ations[] Clients such as Claude Code compute per-model usage from two sources: the top-level usage is attributed to the executor model, and each usage.iterations[] entry of type "advisor_message" is attributed to that entry's own model. Our orchestrator returned only the final executor usage with no iterations array, so the advisor model's tokens folded into the executor's line and its cost showed as zero Emit usage.iterations[] from the orchestrator: one advisor_message entry per advisor sub-call carrying the advisor model and its token counts, on both the non-streaming response and the streaming message_delta. The list ends with the executor's final turn so the client's context-window readout, which reads the last iteration, stays correct

codecov · 2026-06-16T18:01:37Z

Codecov Report

❌ Patch coverage is 88.46154% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ntal_pass_through/messages/fake_stream_iterator.py	70.37%	8 Missing ⚠️
...ntal_pass_through/messages/interceptors/advisor.py	93.91%	7 Missing ⚠️
litellm/llms/anthropic/common_utils.py	66.66%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-06-16T18:06:00Z

Greptile Summary

This PR delivers three coherent fixes to the advisor orchestration loop for non-Anthropic executor providers: routing sub-calls through the proxy's llm_router, surfacing advisor activity as native server_tool_use/advisor_tool_result SSE blocks so clients render the in-progress state, and emitting usage.iterations[] so Claude Code attributes advisor spend to the correct model. A fourth commit adds model-access validation via can_key_call_model and closes several review findings from prior rounds.

Router routing & credential isolation (advisor.py): sub-calls now go through llm_router when available; advisor_kwargs is built explicitly to avoid leaking None credentials or the executor's system prompt/tool_choice into the advisor leg; litellm_metadata is forwarded for budget attribution.
Streaming advisor UX (advisor.py, fake_stream_iterator.py): refactors the loop into _run_loop (async generator of semantic events) consumed by _collect (non-streaming) or _stream (SSE); _stream peeks the first executor event before emitting message_start so input_tokens is accurate.
Bug fixes (common_utils.py, fake_stream_iterator.py): _advisor_result_text recovers advice from the native {\"type\":\"advisor_result\",\"text\":…} dict shape; build_content_block_chunks now only emits content_block_stop when a content_block_start was emitted.

Confidence Score: 5/5

Safe to merge; all changes are additive to an experimental pass-through path with no impact on existing non-advisor routes.

The orchestration refactor is well-structured and the new _run_loop generator correctly covers both the streaming and non-streaming paths with a single execution trace. The auth check follows the established MCP sampling pattern, the credential-isolation logic is explicitly tested, and the streaming ordering guarantee is verified with an interleaving test. The two findings are narrow in scope: one restores a type filter accidentally dropped during a helper extraction, and the other broadens an exception guard that currently only catches ImportError.

No files require special attention beyond the two inline suggestions.

Important Files Changed

Filename	Overview
litellm/constants.py	Adds ADVISOR_SYSTEM_PROMPT constant — a clean, well-commented addition with no issues.
litellm/llms/anthropic/common_utils.py	Adds _advisor_result_text() helper and refactors strip_advisor_blocks_from_messages to use it; the list-case branch drops the original type=="text" filter, allowing any dict with a "text" field to be selected.
litellm/llms/anthropic/experimental_pass_through/messages/fake_stream_iterator.py	Extracts _sse() and build_content_block_chunks() as module-level helpers; adds server_tool_use and advisor_tool_result block types; fixes the orphan content_block_stop bug for unknown types.
litellm/llms/anthropic/experimental_pass_through/messages/interceptors/advisor.py	Major refactor: introduces _run_loop async generator, _collect/_stream consumers, router routing, advisor model access validation, and usage.iterations[] attribution; logic is sound with one note about _get_llm_router only catching ImportError.
tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py	Adds test for the dict-shaped advisor_result content fix; new test is well-targeted and uses mocks only.
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_advisor_integration.py	Adds extensive integration tests (router routing, credential forwarding, metadata isolation, streaming ordering, usage attribution, model access auth) — all mocked, good coverage of the new behaviors.
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_fake_stream_iterator.py	New test file covering server_tool_use and advisor_tool_result streaming, and the no-orphan-stop fix; well-structured and all mocked.
tests/test_litellm/llms/anthropic/messages/test_advisor_orchestration.py	Renames and strengthens test_loop_streaming_wraps_response: moves iteration inside the patch context (correct for lazy generators) and adds content and message_stop assertions.

_{Reviews (2): Last reviewed commit: "ci: retrigger checks" | Re-trigger Greptile}

veria-ai · 2026-06-16T18:18:29Z

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

The spread-merge of usage with iterations produces a plain dict, which mypy rejects against the AnthropicUsage TypedDict. Cast it explicitly.

The advisor_model comes from client-supplied tool input and was routed through llm_router without checking can_key_call_model. A caller whose key only covered the executor model could invoke any router model as advisor. Now validates the advisor model against the caller's UserAPIKeyAuth before entering the orchestration loop. The check only runs inside the proxy (when a router is available); standalone SDK usage has no auth layer and skips it.

Import _sse from fake_stream_iterator.py instead of defining an identical copy in advisor.py.

The streaming path emitted message_start with input_tokens: 0 before any executor call ran. Anthropic's native SSE puts real input_tokens in message_start. Now peeks the first event from the loop to extract the executor's actual input_tokens before emitting message_start.

build_content_block_chunks unconditionally emitted content_block_stop even when no content_block_start was emitted for unrecognized block types, producing an orphan stop event that violates the SSE protocol. Now only emits content_block_stop when a start was emitted.

samagana · 2026-06-16T21:18:26Z

@greptile-apps

Sameerlite · 2026-06-17T03:57:42Z

Thanks for the contribution! One thing to address before we can move forward:

CI is failing — are the failures related to your change? If they're pre-existing or flaky, a quick note would be helpful.

Once those are addressed, we'll take a closer look — thanks again!

samagana added 6 commits June 16, 2026 10:54

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

veria-ai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread litellm/llms/anthropic/experimental_pass_through/messages/interceptors/advisor.py

samagana added 9 commits June 16, 2026 11:29

fix(advisor): cast usage dict to AnthropicUsage for mypy

36d8ceb

The spread-merge of usage with iterations produces a plain dict, which mypy rejects against the AnthropicUsage TypedDict. Cast it explicitly.

style(advisor): break long cast line for Black

9d51094

refactor(advisor): deduplicate _sse helper

9d9b80d

Import _sse from fake_stream_iterator.py instead of defining an identical copy in advisor.py.

style(advisor): remove extra blank line for Black

a430d36

fix(advisor): remove unused json import

ad9c96f

ci: retrigger checks

4b235f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(advisor): route through proxy, render native blocks, and attribute per-model usage#30546

fix(advisor): route through proxy, render native blocks, and attribute per-model usage#30546
samagana wants to merge 15 commits into
BerriAI:litellm_internal_stagingfrom
samagana:litellm_advisor-proxy-routing-and-rendering

samagana commented Jun 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veria-ai Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

samagana commented Jun 16, 2026

Uh oh!

Sameerlite commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

samagana commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Pre-Submission checklist

What and why

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veria-ai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR overview

Security review

Uh oh!

samagana commented Jun 16, 2026

Uh oh!

Sameerlite commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samagana commented Jun 16, 2026 •

edited

Loading

codecov Bot commented Jun 16, 2026 •

edited

Loading

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading

veria-ai Bot commented Jun 16, 2026 •

edited

Loading