Skip to content

fix(advisor): route through proxy, render native blocks, and attribute per-model usage#30546

Open
samagana wants to merge 15 commits into
BerriAI:litellm_internal_stagingfrom
samagana:litellm_advisor-proxy-routing-and-rendering
Open

fix(advisor): route through proxy, render native blocks, and attribute per-model usage#30546
samagana wants to merge 15 commits into
BerriAI:litellm_internal_stagingfrom
samagana:litellm_advisor-proxy-routing-and-rendering

Conversation

@samagana

@samagana samagana commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Relevant issues

Upstream advisor rollout: #25516. Ported from scaledata/litellm PRs #4, #7, #8

Pre-Submission checklist

  • I have added meaningful tests
  • My PR passes all unit tests on make test-unit (the touched/related suites pass; 32 tests across 4 test files)
  • My PR's scope is as isolated as possible; it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

What and why

Three related fixes to the advisor orchestration loop for non-Anthropic executor providers, combined into a single PR since they form a coherent stack (each builds on the prior). Follow-up commits address all review findings from Greptile and Veria-AI.

1. Route sub-calls through the proxy router (scaledata/litellm#4)

Routes advisor and executor sub-calls through the proxy's llm_router in-process when available (falling back to direct anthropic_messages() for standalone SDK use), instead of requiring ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL pointed at localhost with the master key. Stops passing api_key=None/api_base=None to the advisor sub-call, which would override the router's resolved deployment credentials during kwargs merging. Forwards the caller's litellm_metadata into the advisor sub-call so the advisor model's spend is attributed to the caller's key and budget. Gives the advisor sub-call its own role system prompt so it answers as the advisor rather than adopting the executor's persona.

2. Render advisor activity in clients (scaledata/litellm#7)

Through the proxy, a non-Anthropic executor with an advisor_20260301 tool was orchestrated entirely in litellm: the executor/advisor loop ran in-process and only the final, flattened executor response was returned. Claude Code drives its advisor UI from server_tool_use (name advisor) and advisor_tool_result content blocks; when the executor is non-Anthropic the server cannot run that loop, so this change has our orchestrator synthesize the same blocks, making the proxy path render identically to the native one. For streaming, the server_tool_use block is flushed before the advisor sub-call is awaited, so the advisor renders as in-progress for its real latency and resolves when the result arrives.

Also fixes strip_advisor_blocks_from_messages to recover advice text from the native {"type":"advisor_result","text":...} content shape (previously handled only str/list content and silently dropped the dict's text, breaking the multi-turn round-trip).

Also extracts the shared build_content_block_chunks helper in fake_stream_iterator.py and teaches it the server_tool_use and advisor_tool_result block types (previously unknown types were silently dropped from the stream).

3. Attribute advisor spend to its own model (scaledata/litellm#8)

When an advisor consult ran through the proxy, Claude Code's /usage attributed all spend to the executor model and showed nothing for the advisor model. Claude Code computes per-model usage from usage.iterations[] entries of type:"advisor_message". This change emits usage.iterations[] from the orchestrator: one advisor_message entry per advisor sub-call, carrying the advisor model and its token counts, on both the non-streaming response and the streaming message_delta.

4. Review feedback fixes

Addresses all automated review findings:

  • Model authorization bypass (Veria-AI, High): validates advisor_model against the caller's UserAPIKeyAuth via can_key_call_model before entering the loop, following the same pattern as the MCP sampling handler. Skipped in standalone SDK mode (no auth layer)
  • Duplicated _sse helper (Greptile): removed the copy from advisor.py, now imports from fake_stream_iterator.py
  • message_start.usage zero tokens (Greptile): the streaming path now peeks the first loop event to extract the executor's real input_tokens before emitting message_start, matching the Anthropic SSE protocol and FakeAnthropicMessagesStreamIterator
  • Orphan content_block_stop (Greptile): build_content_block_chunks now only emits content_block_stop when a content_block_start was emitted, so unknown block types produce no chunks instead of a protocol-violating orphan stop

CI (LiteLLM team)

  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Run a proxy with a non-Anthropic executor and an Anthropic advisor, then drive the advisor through the streaming /v1/messages passthrough hitting real providers:

# dev_config.yaml
model_list:
  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-2.5-flash
      api_key: os.environ/GEMINI_API_KEY
  - model_name: opus-advisor
    litellm_params:
      model: anthropic/claude-opus-4-8
      api_key: os.environ/ANTHROPIC_API_KEY
python litellm/proxy/proxy_cli.py --config litellm/proxy/dev_config.yaml --detailed_debug --reload --use_v2_migration_resolver 2>&1 | tee litellm.log

curl -N http://localhost:4000/v1/messages \
  -H "content-type: application/json" -H "x-api-key: $LITELLM_MASTER_KEY" \
  -H "anthropic-beta: advisor-tool-2026-03-01" \
  -d '{"model":"gemini-flash","max_tokens":1024,"stream":true,
       "tools":[{"type":"advisor_20260301","name":"advisor","model":"opus-advisor"}],
       "messages":[{"role":"user","content":"Before answering, consult your advisor: cleanest way to debounce in React? Then summarize."}]}'

Expected SSE order: content_block_start for server_tool_use (name advisor), a pause for the advisor latency, then content_block_start for advisor_tool_result, then the executor text. The message_start carries real input_tokens from the executor, and the message_delta carries usage.iterations[] with the advisor model's token counts. In Claude Code pointed at the proxy with /advisor opus-advisor, the advisor dot shows in-progress then resolves, and /usage shows a separate cost line for the advisor model

Type

🐛 Bug Fix

Changes

See the "What and why" section above

samagana added 6 commits June 16, 2026 10:54
…ng env vars

The advisor orchestration handler called anthropic_messages() directly for
sub-calls, which required ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL to be set.
In a proxy setup, the workaround was pointing those env vars at localhost with
the master key, creating a pointless HTTP round-trip.

Now _call_messages_handler checks for the proxy's llm_router and routes through
it when available. The router resolves model deployments and credentials from
the proxy config in-process. Falls back to the direct anthropic_messages() call
for standalone SDK usage (no proxy).

Also fixes a bug where api_key=None / api_base=None were always passed to the
advisor sub-call, which would override the router's deployment credentials
during kwargs merging. Now only passes them when explicitly set in the advisor
tool definition.
The advisor sub-call routed through the proxy router but carried none of the
caller's litellm_metadata (user_api_key/team/budget/session), so the advisor
model's spend was never attributed to the caller's key or budget and the call did
not group under the session in the UI. Forward litellm_metadata into the advisor
leg so it is tracked exactly like the executor leg.

Forward only litellm_metadata, not the executor's generation params: feeding the
advisor the executor's agent system prompt (and tool_choice) made it mimic the
executor and echo the advisor call instead of answering the question.
api_key/api_base still come from the advisor tool definition only.

Regression test asserts the advisor leg receives litellm_metadata but not the
executor's system/tool_choice.
In our orchestration the advisor is handed a plain /v1/messages request rather
than Anthropic's native server-side framing. With no role of its own it adopted
the executor's persona from the forwarded conversation and refused or punted
("there's no separate advisor I can query, I'm answering directly"), which read
to the executor as the advisor kicking the task back.

Add ADVISOR_SYSTEM_PROMPT and pass it as the advisor leg's system prompt so the
advisor answers as the advisor. The executor's own system prompt is still not
forwarded, and the forwarded conversation context is unchanged; this only adds
the role. A regression test asserts the advisor leg carries the advisor role
prompt and not the executor's.
The rebase conflict resolution accidentally stripped comments from
upstream's test_named_params_forwarded_into_advisor_executor_subcall
and test_pre_request_hook_override_does_not_collide_with_explicit_kwargs.
Restoring them so the PR diff only adds new tests without modifying
upstream code.
…dvisor blocks

The non-native advisor orchestrator flattened the executor/advisor loop into a
single final response, so Claude Code (and any client) saw a plain message with
no signal that an advisor was consulted. Surface each advisor exchange as the
native server_tool_use (name "advisor") and advisor_tool_result blocks that
clients key their advisor UI on

Streaming now runs a real orchestrator: the server_tool_use block is flushed
before the advisor sub-call is awaited, so the advisor renders as in-progress
for its real latency and resolves when the result arrives, matching the native
server-side experience. Non-streaming and streaming share one _run_loop
generator so the two paths cannot diverge

Also fix strip_advisor_blocks_from_messages to recover advice text from the
native {"type":"advisor_result","text":...} content shape; it previously
handled only str/list content and silently dropped the dict's text, which
broke the multi-turn round-trip once we started emitting these blocks
…ations[]

Clients such as Claude Code compute per-model usage from two sources: the
top-level usage is attributed to the executor model, and each usage.iterations[]
entry of type "advisor_message" is attributed to that entry's own model. Our
orchestrator returned only the final executor usage with no iterations array, so
the advisor model's tokens folded into the executor's line and its cost showed
as zero

Emit usage.iterations[] from the orchestrator: one advisor_message entry per
advisor sub-call carrying the advisor model and its token counts, on both the
non-streaming response and the streaming message_delta. The list ends with the
executor's final turn so the client's context-window readout, which reads the
last iteration, stays correct
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.46154% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ntal_pass_through/messages/fake_stream_iterator.py 70.37% 8 Missing ⚠️
...ntal_pass_through/messages/interceptors/advisor.py 93.91% 7 Missing ⚠️
litellm/llms/anthropic/common_utils.py 66.66% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR delivers three coherent fixes to the advisor orchestration loop for non-Anthropic executor providers: routing sub-calls through the proxy's llm_router, surfacing advisor activity as native server_tool_use/advisor_tool_result SSE blocks so clients render the in-progress state, and emitting usage.iterations[] so Claude Code attributes advisor spend to the correct model. A fourth commit adds model-access validation via can_key_call_model and closes several review findings from prior rounds.

  • Router routing & credential isolation (advisor.py): sub-calls now go through llm_router when available; advisor_kwargs is built explicitly to avoid leaking None credentials or the executor's system prompt/tool_choice into the advisor leg; litellm_metadata is forwarded for budget attribution.
  • Streaming advisor UX (advisor.py, fake_stream_iterator.py): refactors the loop into _run_loop (async generator of semantic events) consumed by _collect (non-streaming) or _stream (SSE); _stream peeks the first executor event before emitting message_start so input_tokens is accurate.
  • Bug fixes (common_utils.py, fake_stream_iterator.py): _advisor_result_text recovers advice from the native {\"type\":\"advisor_result\",\"text\":…} dict shape; build_content_block_chunks now only emits content_block_stop when a content_block_start was emitted.

Confidence Score: 5/5

Safe to merge; all changes are additive to an experimental pass-through path with no impact on existing non-advisor routes.

The orchestration refactor is well-structured and the new _run_loop generator correctly covers both the streaming and non-streaming paths with a single execution trace. The auth check follows the established MCP sampling pattern, the credential-isolation logic is explicitly tested, and the streaming ordering guarantee is verified with an interleaving test. The two findings are narrow in scope: one restores a type filter accidentally dropped during a helper extraction, and the other broadens an exception guard that currently only catches ImportError.

No files require special attention beyond the two inline suggestions.

Important Files Changed

Filename Overview
litellm/constants.py Adds ADVISOR_SYSTEM_PROMPT constant — a clean, well-commented addition with no issues.
litellm/llms/anthropic/common_utils.py Adds _advisor_result_text() helper and refactors strip_advisor_blocks_from_messages to use it; the list-case branch drops the original type=="text" filter, allowing any dict with a "text" field to be selected.
litellm/llms/anthropic/experimental_pass_through/messages/fake_stream_iterator.py Extracts _sse() and build_content_block_chunks() as module-level helpers; adds server_tool_use and advisor_tool_result block types; fixes the orphan content_block_stop bug for unknown types.
litellm/llms/anthropic/experimental_pass_through/messages/interceptors/advisor.py Major refactor: introduces _run_loop async generator, _collect/_stream consumers, router routing, advisor model access validation, and usage.iterations[] attribution; logic is sound with one note about _get_llm_router only catching ImportError.
tests/test_litellm/llms/anthropic/chat/test_anthropic_chat_transformation.py Adds test for the dict-shaped advisor_result content fix; new test is well-targeted and uses mocks only.
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_advisor_integration.py Adds extensive integration tests (router routing, credential forwarding, metadata isolation, streaming ordering, usage attribution, model access auth) — all mocked, good coverage of the new behaviors.
tests/test_litellm/llms/anthropic/experimental_pass_through/messages/test_fake_stream_iterator.py New test file covering server_tool_use and advisor_tool_result streaming, and the no-orphan-stop fix; well-structured and all mocked.
tests/test_litellm/llms/anthropic/messages/test_advisor_orchestration.py Renames and strengthens test_loop_streaming_wraps_response: moves iteration inside the patch context (correct for lazy generators) and adds content and message_stop assertions.

Reviews (2): Last reviewed commit: "ci: retrigger checks" | Re-trigger Greptile

Comment thread litellm/llms/anthropic/experimental_pass_through/messages/interceptors/advisor.py Outdated
Comment thread litellm/llms/anthropic/experimental_pass_through/messages/fake_stream_iterator.py Outdated
@veria-ai

veria-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

samagana added 9 commits June 16, 2026 11:29
The spread-merge of usage with iterations produces a plain dict, which
mypy rejects against the AnthropicUsage TypedDict. Cast it explicitly.
The advisor_model comes from client-supplied tool input and was routed
through llm_router without checking can_key_call_model. A caller whose
key only covered the executor model could invoke any router model as
advisor. Now validates the advisor model against the caller's
UserAPIKeyAuth before entering the orchestration loop. The check only
runs inside the proxy (when a router is available); standalone SDK
usage has no auth layer and skips it.
Import _sse from fake_stream_iterator.py instead of defining an
identical copy in advisor.py.
The streaming path emitted message_start with input_tokens: 0 before
any executor call ran. Anthropic's native SSE puts real input_tokens
in message_start. Now peeks the first event from the loop to extract
the executor's actual input_tokens before emitting message_start.
build_content_block_chunks unconditionally emitted content_block_stop
even when no content_block_start was emitted for unrecognized block
types, producing an orphan stop event that violates the SSE protocol.
Now only emits content_block_stop when a start was emitted.
@samagana

Copy link
Copy Markdown
Contributor Author

@greptile-apps

@Sameerlite

Copy link
Copy Markdown
Collaborator

Thanks for the contribution! One thing to address before we can move forward:

  • CI is failing — are the failures related to your change? If they're pre-existing or flaky, a quick note would be helpful.

Once those are addressed, we'll take a closer look — thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants